subject:"Unexplained Kernel Panic \/ Hung Task"

Unexplained Kernel Panic / Hung Task

2013-12-04 Thread ~Stack~

Greetings,

I have a test system I use for testing deployments and when I am not
using it, it runs Boinc. It is a Scientific Linux 6.4 fully updated box.
Recently (last ~3 weeks) I have started getting the same kernel panic.
Sometimes it will be multiple times in a single day and other times it
will be days before the next one (it just had a 5 day uptime). But the
kernel panic looks pretty much the same. It is a complaint about a hung
task plus information about the ext4 file system. I have run the
smartmon tool against both drives (2 drives setup in a hardware RAID
mirror) and both drives checkout fine. I ran a fsck against the /
partition and everything looked fine (on this text box there is only /
and swap partitions). I even took out a drive at a time and had the same
crashes (though this could be an indicator that both drives are bad). I
am wondering if my RAID card is going bad.

When the crash happens I still have the SSH prompt, however, I can only
do basic things like navigating directories and sometimes reading files.
Writing to a file seems to hang, using tab-autocomplete will frequently
hang, running most programs (even `init 6` or `top`) will hang.

It crashed again last night, and I am kind of stumped. I would greatly
appreciate others thoughts and input on what the problem might be.

Thanks!
~Stack~

Dec  4 02:25:09 testbox kernel: INFO: task jbd2/cciss!c0d0:273 blocked
for more than 120 seconds.
Dec  4 02:25:09 testbox kernel: echo 0 
/proc/sys/kernel/hung_task_timeout_secs disables this message.
Dec  4 02:25:09 testbox kernel: jbd2/cciss!c0 D  0
 273  2 0x
Dec  4 02:25:09 testbox kernel: 8802142cfb30 0046
8802138b5800 1000
Dec  4 02:25:09 testbox kernel: 8802142cfaa0 81012c59
8802142cfae0 810a2431
Dec  4 02:25:09 testbox kernel: 880214157058 8802142cffd8
fb88 880214157058
Dec  4 02:25:09 testbox kernel: Call Trace:
Dec  4 02:25:09 testbox kernel: [81012c59] ? read_tsc+0x9/0x20
Dec  4 02:25:09 testbox kernel: [810a2431] ?
ktime_get_ts+0xb1/0xf0
Dec  4 02:25:09 testbox kernel: [810a2431] ?
ktime_get_ts+0xb1/0xf0
Dec  4 02:25:09 testbox kernel: [81119e10] ? sync_page+0x0/0x50
Dec  4 02:25:09 testbox kernel: [8150e953] io_schedule+0x73/0xc0
Dec  4 02:25:09 testbox kernel: [81119e4d] sync_page+0x3d/0x50
Dec  4 02:25:09 testbox kernel: [8150f30f] __wait_on_bit+0x5f/0x90
Dec  4 02:25:09 testbox kernel: [8111a083]
wait_on_page_bit+0x73/0x80
Dec  4 02:25:09 testbox kernel: [81096de0] ?
wake_bit_function+0x0/0x50
Dec  4 02:25:09 testbox kernel: [8112f115] ?
pagevec_lookup_tag+0x25/0x40
Dec  4 02:25:09 testbox kernel: [8111a4ab]
wait_on_page_writeback_range+0xfb/0x190
Dec  4 02:25:09 testbox kernel: [8125d42d] ? submit_bio+0x8d/0x120
Dec  4 02:25:09 testbox kernel: [8111a56f]
filemap_fdatawait+0x2f/0x40
Dec  4 02:25:09 testbox kernel: [a004de59]
jbd2_journal_commit_transaction+0x7e9/0x1500 [jbd2]
Dec  4 02:25:09 testbox kernel: [8100975d] ?
__switch_to+0x13d/0x320
Dec  4 02:25:09 testbox kernel: [81081b5b] ?
try_to_del_timer_sync+0x7b/0xe0
Dec  4 02:25:09 testbox kernel: [a0054148]
kjournald2+0xb8/0x220 [jbd2]
Dec  4 02:25:09 testbox kernel: [81096da0] ?
autoremove_wake_function+0x0/0x40
Dec  4 02:25:09 testbox kernel: [a0054090] ?
kjournald2+0x0/0x220 [jbd2]
Dec  4 02:25:09 testbox kernel: [81096a36] kthread+0x96/0xa0
Dec  4 02:25:09 testbox kernel: [8100c0ca] child_rip+0xa/0x20
Dec  4 02:25:09 testbox kernel: [810969a0] ? kthread+0x0/0xa0
Dec  4 02:25:09 testbox kernel: [8100c0c0] ? child_rip+0x0/0x20
Dec  4 02:25:09 testbox kernel: INFO: task master:1058 blocked for more
than 120 seconds.
Dec  4 02:25:09 testbox kernel: echo 0 
/proc/sys/kernel/hung_task_timeout_secs disables this message.
Dec  4 02:25:09 testbox kernel: masterD  0
1058  1 0x0080
Dec  4 02:25:09 testbox kernel: 88021535d948 0082
88021535d8d8 81065c75
Dec  4 02:25:09 testbox kernel: 880028216700 88021396b578
880214336ad8 880028216700
Dec  4 02:25:09 testbox kernel: 88021396baf8 88021535dfd8
fb88 88021396baf8
Dec  4 02:25:09 testbox kernel: Call Trace:
Dec  4 02:25:09 testbox kernel: [81065c75] ?
enqueue_entity+0x125/0x410
Dec  4 02:25:09 testbox kernel: [810a2431] ?
ktime_get_ts+0xb1/0xf0
Dec  4 02:25:09 testbox kernel: [811b62b0] ? sync_buffer+0x0/0x50
Dec  4 02:25:09 testbox kernel: [8150e953] io_schedule+0x73/0xc0
Dec  4 02:25:09 testbox kernel: [811b62f0] sync_buffer+0x40/0x50
Dec  4 02:25:09 testbox kernel: [8150f1ba]
__wait_on_bit_lock+0x5a/0xc0
Dec  4 02:25:09 testbox kernel: [811b62b0] ? sync_buffer+0x0/0x50
Dec  4 02:25:09 testbox kernel: [8150f298]

Re: Unexplained Kernel Panic / Hung Task

2013-12-04 Thread Paul Robert Marino

Yup that's a hardware problem.It may be a bad firmware on the controller I would check the firmware version first and see if there is a patch. I've seen this kind of thing with Dell OEMed RAID controllers enough over the years that that's almost always the first thing I try.-- Sent from my HP Pre3On Dec 4, 2013 8:21, ~Stack~ i.am.st...@gmail.com wrote: Greetings,

I have a test system I use for testing deployments and when I am not
using it, it runs Boinc. It is a Scientific Linux 6.4 fully updated box.
Recently (last ~3 weeks) I have started getting the same kernel panic.
Sometimes it will be multiple times in a single day and other times it
will be days before the next one (it just had a 5 day uptime). But the
kernel panic looks pretty much the same. It is a complaint about a hung
task plus information about the ext4 file system. I have run the
smartmon tool against both drives (2 drives setup in a hardware RAID
mirror) and both drives checkout fine. I ran a fsck against the /
partition and everything looked fine (on this text box there is only /
and swap partitions). I even took out a drive at a time and had the same
crashes (though this could be an indicator that both drives are bad). I
am wondering if my RAID card is going bad.

When the crash happens I still have the SSH prompt, however, I can only
do basic things like navigating directories and sometimes reading files.
Writing to a file seems to hang, using tab-autocomplete will frequently
hang, running most programs (even `init 6` or `top`) will hang.

It crashed again last night, and I am kind of stumped. I would greatly
appreciate others thoughts and input on what the problem might be.

Thanks!
~Stack~

Dec  4 02:25:09 testbox kernel: INFO: task jbd2/cciss!c0d0:273 blocked
for more than 120 seconds.
Dec  4 02:25:09 testbox kernel: "echo 0 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  4 02:25:09 testbox kernel: jbd2/cciss!c0 D  0
 273  2 0x
Dec  4 02:25:09 testbox kernel: 8802142cfb30 0046
8802138b5800 1000
Dec  4 02:25:09 testbox kernel: 8802142cfaa0 81012c59
8802142cfae0 810a2431
Dec  4 02:25:09 testbox kernel: 880214157058 8802142cffd8
fb88 880214157058
Dec  4 02:25:09 testbox kernel: Call Trace:
Dec  4 02:25:09 testbox kernel: [81012c59] ? read_tsc+0x9/0x20
Dec  4 02:25:09 testbox kernel: [810a2431] ?
ktime_get_ts+0xb1/0xf0
Dec  4 02:25:09 testbox kernel: [810a2431] ?
ktime_get_ts+0xb1/0xf0
Dec  4 02:25:09 testbox kernel: [81119e10] ? sync_page+0x0/0x50
Dec  4 02:25:09 testbox kernel: [8150e953] io_schedule+0x73/0xc0
Dec  4 02:25:09 testbox kernel: [81119e4d] sync_page+0x3d/0x50
Dec  4 02:25:09 testbox kernel: [8150f30f] __wait_on_bit+0x5f/0x90
Dec  4 02:25:09 testbox kernel: [8111a083]
wait_on_page_bit+0x73/0x80
Dec  4 02:25:09 testbox kernel: [81096de0] ?
wake_bit_function+0x0/0x50
Dec  4 02:25:09 testbox kernel: [8112f115] ?
pagevec_lookup_tag+0x25/0x40
Dec  4 02:25:09 testbox kernel: [8111a4ab]
wait_on_page_writeback_range+0xfb/0x190
Dec  4 02:25:09 testbox kernel: [8125d42d] ? submit_bio+0x8d/0x120
Dec  4 02:25:09 testbox kernel: [8111a56f]
filemap_fdatawait+0x2f/0x40
Dec  4 02:25:09 testbox kernel: [a004de59]
jbd2_journal_commit_transaction+0x7e9/0x1500 [jbd2]
Dec  4 02:25:09 testbox kernel: [8100975d] ?
__switch_to+0x13d/0x320
Dec  4 02:25:09 testbox kernel: [81081b5b] ?
try_to_del_timer_sync+0x7b/0xe0
Dec  4 02:25:09 testbox kernel: [a0054148]
kjournald2+0xb8/0x220 [jbd2]
Dec  4 02:25:09 testbox kernel: [81096da0] ?
autoremove_wake_function+0x0/0x40
Dec  4 02:25:09 testbox kernel: [a0054090] ?
kjournald2+0x0/0x220 [jbd2]
Dec  4 02:25:09 testbox kernel: [81096a36] kthread+0x96/0xa0
Dec  4 02:25:09 testbox kernel: [8100c0ca] child_rip+0xa/0x20
Dec  4 02:25:09 testbox kernel: [810969a0] ? kthread+0x0/0xa0
Dec  4 02:25:09 testbox kernel: [8100c0c0] ? child_rip+0x0/0x20
Dec  4 02:25:09 testbox kernel: INFO: task master:1058 blocked for more
than 120 seconds.
Dec  4 02:25:09 testbox kernel: "echo 0 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  4 02:25:09 testbox kernel: masterD  0
1058  1 0x0080
Dec  4 02:25:09 testbox kernel: 88021535d948 0082
88021535d8d8 81065c75
Dec  4 02:25:09 testbox kernel: 880028216700 88021396b578
880214336ad8 880028216700
Dec  4 02:25:09 testbox kernel: 88021396baf8 88021535dfd8
fb88 88021396baf8
Dec  4 02:25:09 testbox kernel: Call Trace:
Dec  4 02:25:09 testbox kernel: [81065c75] ?
enqueue_entity+0x125/0x410
Dec  4 02:25:09 testbox kernel: [810a2431] ?
ktime_get_ts+0xb1/0xf0
Dec  4 02:25:09 testbox kernel: [811b62b0] ? sync_buffer+0x0/0x50
Dec  4

Re: Unexplained Kernel Panic / Hung Task

2013-12-04 Thread David Sommerseth

On 04/12/13 14:21, ~Stack~ wrote: Greetings,

I have a test system I use for testing deployments and when I am not
using it, it runs Boinc. It is a Scientific Linux 6.4 fully updated box.
Recently (last ~3 weeks) I have started getting the same kernel panic.
Sometimes it will be multiple times in a single day and other times it
will be days before the next one (it just had a 5 day uptime). But the
kernel panic looks pretty much the same. It is a complaint about a hung
task plus information about the ext4 file system. I have run the
smartmon tool against both drives (2 drives setup in a hardware RAID
mirror) and both drives checkout fine. I ran a fsck against the /
partition and everything looked fine (on this text box there is only /
and swap partitions). I even took out a drive at a time and had the same
crashes (though this could be an indicator that both drives are bad). I
am wondering if my RAID card is going bad.

When the crash happens I still have the SSH prompt, however, I can only
do basic things like navigating directories and sometimes reading files.
Writing to a file seems to hang, using tab-autocomplete will frequently
hang, running most programs (even `init 6` or `top`) will hang.

It crashed again last night, and I am kind of stumped. I would greatly
appreciate others thoughts and input on what the problem might be.

Thanks!
~Stack~

Dec 4 02:25:09 testbox kernel: INFO: task jbd2/cciss!c0d0:273 blocked
for more than 120 seconds.
Dec 4 02:25:09 testbox kernel: echo 0
/proc/sys/kernel/hung_task_timeout_secs disables this message.
Dec 4 02:25:09 testbox kernel: jbd2/cciss!c0 D 0
273 2 0x
Dec 4 02:25:09 testbox kernel: 8802142cfb30 0046
8802138b5800 1000
Dec 4 02:25:09 testbox kernel: 8802142cfaa0 81012c59
8802142cfae0 810a2431
Dec 4 02:25:09 testbox kernel: 880214157058 8802142cffd8
fb88 880214157058
Dec 4 02:25:09 testbox kernel: Call Trace:
Dec 4 02:25:09 testbox kernel: [81012c59] ? read_tsc+0x9/0x20

This looks like some locking issue to me, triggered by something around the
TSC timer.

This is either a buggy driver (most likely the ccsis driver) or a related
firmware (read the complete boot log carefully, look after firmware warnings).
Or it's a really unstable TSC clock source. Try switching from TSC to HPET
(or in really worst case acpi_pm). See this KB for some related info:
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Timestamping.html

But my hunch tells me it's a driver related issue, with some bad locking.
There seems to be several filesystem operations happening on two or more CPU
cores in a certain order which seems to trigger a deadlock.

--
kind regards,

David Sommerseth

Re: Unexplained Kernel Panic / Hung Task

2013-12-04 Thread Paul Robert Marino

Well I tend to discount the driver idea because of an other problem he has involving multiple what I think are identical machines . Also any problems I've ever had with the ccsis driver were usually firmware related an a update or roll back usually corrects them.Besides the based on what I've heard this is low budget equipment and ProLiants aren't cheap. If I had to guess we are talking about Dells.-- Sent from my HP Pre3On Dec 4, 2013 18:36, David Sommerseth sl+us...@lists.topphemmelig.net wrote: On 04/12/13 14:21, ~Stack~ wrote: Greetings,

It crashed again last night, and I am kind of stumped. I would greatly
appreciate others thoughts and input on what the problem might be.

Thanks!
~Stack~

Dec 4 02:25:09 testbox kernel: INFO: task jbd2/cciss!c0d0:273 blocked
for more than 120 seconds.
Dec 4 02:25:09 testbox kernel: "echo 0
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 4 02:25:09 testbox kernel: jbd2/cciss!c0 D 0
273 2 0x
Dec 4 02:25:09 testbox kernel: 8802142cfb30 0046
8802138b5800 1000
Dec 4 02:25:09 testbox kernel: 8802142cfaa0 81012c59
8802142cfae0 810a2431
Dec 4 02:25:09 testbox kernel: 880214157058 8802142cffd8
fb88 880214157058
Dec 4 02:25:09 testbox kernel: Call Trace:
Dec 4 02:25:09 testbox kernel: [81012c59] ? read_tsc+0x9/0x20

This looks like some locking issue to me, triggered by something around the
TSC timer.

--
kind regards,

David Sommerseth

Re: Unexplained Kernel Panic / Hung Task

2013-12-04 Thread Paul Robert Marino

If not down rev it to the same version as the one that works.It isn't hard to do with their utilities because those of us who work in mission critical environment have hammered it into their heads that its an absolute requierment-- Sent from my HP Pre3On Dec 4, 2013 19:12, ~Stack~ i.am.st...@gmail.com wrote: On 12/04/2013 05:51 PM, Paul Robert Marino wrote:
Well I tend to discount the driver idea because of an other problem he
has involving multiple what I think are identical machines . Also any
problems I've ever had with the ccsis driver were usually firmware
related an a update or roll back usually corrects them.
Besides the based on what I've heard this is low budget equipment and
ProLiants aren't cheap. If I had to guess we are talking about Dells.

You are right, in that I am experiencing two different issues and the
vast majority of my test lab is older cast-away parts. The difference is
that both issues are on very different systems.

The DHCP problem is on a bunch of similar generic Dells. This particular
problem is on a HP Prolient DL360 G4 which its twin (same hardware specs
and thanks to Puppet should be dang-near identical in terms of software)
so far has not displayed this problem.

Because the twin isn't having this problem and the problem only started
~3 weeks ago is why I thought for the last few weeks it was a disk drive
problem.

I am looking up the firmware versions for this box now. I am not hopeful
that I will find a newer firmware for this old of a system though.
Still, totally worth the try! :-)

Thanks!
~Stack~

Unexplained Kernel Panic / Hung Task

Re: Unexplained Kernel Panic / Hung Task

Re: Unexplained Kernel Panic / Hung Task

Re: Unexplained Kernel Panic / Hung Task

Re: Unexplained Kernel Panic / Hung Task

5 matches

Site Navigation

Mail list logo

Footer information