Unexplained Kernel Panic / Hung Task

2013-12-04 Thread ~Stack~
Greetings,

I have a test system I use for testing deployments and when I am not
using it, it runs Boinc. It is a Scientific Linux 6.4 fully updated box.
Recently (last ~3 weeks) I have started getting the same kernel panic.
Sometimes it will be multiple times in a single day and other times it
will be days before the next one (it just had a 5 day uptime). But the
kernel panic looks pretty much the same. It is a complaint about a hung
task plus information about the ext4 file system. I have run the
smartmon tool against both drives (2 drives setup in a hardware RAID
mirror) and both drives checkout fine. I ran a fsck against the /
partition and everything looked fine (on this text box there is only /
and swap partitions). I even took out a drive at a time and had the same
crashes (though this could be an indicator that both drives are bad). I
am wondering if my RAID card is going bad.

When the crash happens I still have the SSH prompt, however, I can only
do basic things like navigating directories and sometimes reading files.
Writing to a file seems to hang, using tab-autocomplete will frequently
hang, running most programs (even `init 6` or `top`) will hang.

It crashed again last night, and I am kind of stumped. I would greatly
appreciate others thoughts and input on what the problem might be.

Thanks!
~Stack~

Dec  4 02:25:09 testbox kernel: INFO: task jbd2/cciss!c0d0:273 blocked
for more than 120 seconds.
Dec  4 02:25:09 testbox kernel: echo 0 
/proc/sys/kernel/hung_task_timeout_secs disables this message.
Dec  4 02:25:09 testbox kernel: jbd2/cciss!c0 D  0
 273  2 0x
Dec  4 02:25:09 testbox kernel: 8802142cfb30 0046
8802138b5800 1000
Dec  4 02:25:09 testbox kernel: 8802142cfaa0 81012c59
8802142cfae0 810a2431
Dec  4 02:25:09 testbox kernel: 880214157058 8802142cffd8
fb88 880214157058
Dec  4 02:25:09 testbox kernel: Call Trace:
Dec  4 02:25:09 testbox kernel: [81012c59] ? read_tsc+0x9/0x20
Dec  4 02:25:09 testbox kernel: [810a2431] ?
ktime_get_ts+0xb1/0xf0
Dec  4 02:25:09 testbox kernel: [810a2431] ?
ktime_get_ts+0xb1/0xf0
Dec  4 02:25:09 testbox kernel: [81119e10] ? sync_page+0x0/0x50
Dec  4 02:25:09 testbox kernel: [8150e953] io_schedule+0x73/0xc0
Dec  4 02:25:09 testbox kernel: [81119e4d] sync_page+0x3d/0x50
Dec  4 02:25:09 testbox kernel: [8150f30f] __wait_on_bit+0x5f/0x90
Dec  4 02:25:09 testbox kernel: [8111a083]
wait_on_page_bit+0x73/0x80
Dec  4 02:25:09 testbox kernel: [81096de0] ?
wake_bit_function+0x0/0x50
Dec  4 02:25:09 testbox kernel: [8112f115] ?
pagevec_lookup_tag+0x25/0x40
Dec  4 02:25:09 testbox kernel: [8111a4ab]
wait_on_page_writeback_range+0xfb/0x190
Dec  4 02:25:09 testbox kernel: [8125d42d] ? submit_bio+0x8d/0x120
Dec  4 02:25:09 testbox kernel: [8111a56f]
filemap_fdatawait+0x2f/0x40
Dec  4 02:25:09 testbox kernel: [a004de59]
jbd2_journal_commit_transaction+0x7e9/0x1500 [jbd2]
Dec  4 02:25:09 testbox kernel: [8100975d] ?
__switch_to+0x13d/0x320
Dec  4 02:25:09 testbox kernel: [81081b5b] ?
try_to_del_timer_sync+0x7b/0xe0
Dec  4 02:25:09 testbox kernel: [a0054148]
kjournald2+0xb8/0x220 [jbd2]
Dec  4 02:25:09 testbox kernel: [81096da0] ?
autoremove_wake_function+0x0/0x40
Dec  4 02:25:09 testbox kernel: [a0054090] ?
kjournald2+0x0/0x220 [jbd2]
Dec  4 02:25:09 testbox kernel: [81096a36] kthread+0x96/0xa0
Dec  4 02:25:09 testbox kernel: [8100c0ca] child_rip+0xa/0x20
Dec  4 02:25:09 testbox kernel: [810969a0] ? kthread+0x0/0xa0
Dec  4 02:25:09 testbox kernel: [8100c0c0] ? child_rip+0x0/0x20
Dec  4 02:25:09 testbox kernel: INFO: task master:1058 blocked for more
than 120 seconds.
Dec  4 02:25:09 testbox kernel: echo 0 
/proc/sys/kernel/hung_task_timeout_secs disables this message.
Dec  4 02:25:09 testbox kernel: masterD  0
1058  1 0x0080
Dec  4 02:25:09 testbox kernel: 88021535d948 0082
88021535d8d8 81065c75
Dec  4 02:25:09 testbox kernel: 880028216700 88021396b578
880214336ad8 880028216700
Dec  4 02:25:09 testbox kernel: 88021396baf8 88021535dfd8
fb88 88021396baf8
Dec  4 02:25:09 testbox kernel: Call Trace:
Dec  4 02:25:09 testbox kernel: [81065c75] ?
enqueue_entity+0x125/0x410
Dec  4 02:25:09 testbox kernel: [810a2431] ?
ktime_get_ts+0xb1/0xf0
Dec  4 02:25:09 testbox kernel: [811b62b0] ? sync_buffer+0x0/0x50
Dec  4 02:25:09 testbox kernel: [8150e953] io_schedule+0x73/0xc0
Dec  4 02:25:09 testbox kernel: [811b62f0] sync_buffer+0x40/0x50
Dec  4 02:25:09 testbox kernel: [8150f1ba]
__wait_on_bit_lock+0x5a/0xc0
Dec  4 02:25:09 testbox kernel: [811b62b0] ? sync_buffer+0x0/0x50
Dec  4 02:25:09 testbox kernel: [8150f298]

Re: Unexplained Kernel Panic / Hung Task

2013-12-04 Thread Paul Robert Marino
Yup that's a hardware problem.It may be a bad firmware on the controller I would check the firmware version first and see if there is a patch. I've seen this kind of thing with Dell OEMed RAID controllers enough over the years that that's almost always the first thing I try.-- Sent from my HP Pre3On Dec 4, 2013 8:21, ~Stack~ i.am.st...@gmail.com wrote: Greetings,

I have a test system I use for testing deployments and when I am not
using it, it runs Boinc. It is a Scientific Linux 6.4 fully updated box.
Recently (last ~3 weeks) I have started getting the same kernel panic.
Sometimes it will be multiple times in a single day and other times it
will be days before the next one (it just had a 5 day uptime). But the
kernel panic looks pretty much the same. It is a complaint about a hung
task plus information about the ext4 file system. I have run the
smartmon tool against both drives (2 drives setup in a hardware RAID
mirror) and both drives checkout fine. I ran a fsck against the /
partition and everything looked fine (on this text box there is only /
and swap partitions). I even took out a drive at a time and had the same
crashes (though this could be an indicator that both drives are bad). I
am wondering if my RAID card is going bad.

When the crash happens I still have the SSH prompt, however, I can only
do basic things like navigating directories and sometimes reading files.
Writing to a file seems to hang, using tab-autocomplete will frequently
hang, running most programs (even `init 6` or `top`) will hang.

It crashed again last night, and I am kind of stumped. I would greatly
appreciate others thoughts and input on what the problem might be.

Thanks!
~Stack~

Dec  4 02:25:09 testbox kernel: INFO: task jbd2/cciss!c0d0:273 blocked
for more than 120 seconds.
Dec  4 02:25:09 testbox kernel: "echo 0 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  4 02:25:09 testbox kernel: jbd2/cciss!c0 D  0
 273  2 0x
Dec  4 02:25:09 testbox kernel: 8802142cfb30 0046
8802138b5800 1000
Dec  4 02:25:09 testbox kernel: 8802142cfaa0 81012c59
8802142cfae0 810a2431
Dec  4 02:25:09 testbox kernel: 880214157058 8802142cffd8
fb88 880214157058
Dec  4 02:25:09 testbox kernel: Call Trace:
Dec  4 02:25:09 testbox kernel: [81012c59] ? read_tsc+0x9/0x20
Dec  4 02:25:09 testbox kernel: [810a2431] ?
ktime_get_ts+0xb1/0xf0
Dec  4 02:25:09 testbox kernel: [810a2431] ?
ktime_get_ts+0xb1/0xf0
Dec  4 02:25:09 testbox kernel: [81119e10] ? sync_page+0x0/0x50
Dec  4 02:25:09 testbox kernel: [8150e953] io_schedule+0x73/0xc0
Dec  4 02:25:09 testbox kernel: [81119e4d] sync_page+0x3d/0x50
Dec  4 02:25:09 testbox kernel: [8150f30f] __wait_on_bit+0x5f/0x90
Dec  4 02:25:09 testbox kernel: [8111a083]
wait_on_page_bit+0x73/0x80
Dec  4 02:25:09 testbox kernel: [81096de0] ?
wake_bit_function+0x0/0x50
Dec  4 02:25:09 testbox kernel: [8112f115] ?
pagevec_lookup_tag+0x25/0x40
Dec  4 02:25:09 testbox kernel: [8111a4ab]
wait_on_page_writeback_range+0xfb/0x190
Dec  4 02:25:09 testbox kernel: [8125d42d] ? submit_bio+0x8d/0x120
Dec  4 02:25:09 testbox kernel: [8111a56f]
filemap_fdatawait+0x2f/0x40
Dec  4 02:25:09 testbox kernel: [a004de59]
jbd2_journal_commit_transaction+0x7e9/0x1500 [jbd2]
Dec  4 02:25:09 testbox kernel: [8100975d] ?
__switch_to+0x13d/0x320
Dec  4 02:25:09 testbox kernel: [81081b5b] ?
try_to_del_timer_sync+0x7b/0xe0
Dec  4 02:25:09 testbox kernel: [a0054148]
kjournald2+0xb8/0x220 [jbd2]
Dec  4 02:25:09 testbox kernel: [81096da0] ?
autoremove_wake_function+0x0/0x40
Dec  4 02:25:09 testbox kernel: [a0054090] ?
kjournald2+0x0/0x220 [jbd2]
Dec  4 02:25:09 testbox kernel: [81096a36] kthread+0x96/0xa0
Dec  4 02:25:09 testbox kernel: [8100c0ca] child_rip+0xa/0x20
Dec  4 02:25:09 testbox kernel: [810969a0] ? kthread+0x0/0xa0
Dec  4 02:25:09 testbox kernel: [8100c0c0] ? child_rip+0x0/0x20
Dec  4 02:25:09 testbox kernel: INFO: task master:1058 blocked for more
than 120 seconds.
Dec  4 02:25:09 testbox kernel: "echo 0 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  4 02:25:09 testbox kernel: masterD  0
1058  1 0x0080
Dec  4 02:25:09 testbox kernel: 88021535d948 0082
88021535d8d8 81065c75
Dec  4 02:25:09 testbox kernel: 880028216700 88021396b578
880214336ad8 880028216700
Dec  4 02:25:09 testbox kernel: 88021396baf8 88021535dfd8
fb88 88021396baf8
Dec  4 02:25:09 testbox kernel: Call Trace:
Dec  4 02:25:09 testbox kernel: [81065c75] ?
enqueue_entity+0x125/0x410
Dec  4 02:25:09 testbox kernel: [810a2431] ?
ktime_get_ts+0xb1/0xf0
Dec  4 02:25:09 testbox kernel: [811b62b0] ? sync_buffer+0x0/0x50
Dec  4 

Re: Unexplained Kernel Panic / Hung Task

2013-12-04 Thread David Sommerseth

On 04/12/13 14:21, ~Stack~ wrote: Greetings,

 I have a test system I use for testing deployments and when I am not
 using it, it runs Boinc. It is a Scientific Linux 6.4 fully updated box.
 Recently (last ~3 weeks) I have started getting the same kernel panic.
 Sometimes it will be multiple times in a single day and other times it
 will be days before the next one (it just had a 5 day uptime). But the
 kernel panic looks pretty much the same. It is a complaint about a hung
 task plus information about the ext4 file system. I have run the
 smartmon tool against both drives (2 drives setup in a hardware RAID
 mirror) and both drives checkout fine. I ran a fsck against the /
 partition and everything looked fine (on this text box there is only /
 and swap partitions). I even took out a drive at a time and had the same
 crashes (though this could be an indicator that both drives are bad). I
 am wondering if my RAID card is going bad.

 When the crash happens I still have the SSH prompt, however, I can only
 do basic things like navigating directories and sometimes reading files.
 Writing to a file seems to hang, using tab-autocomplete will frequently
 hang, running most programs (even `init 6` or `top`) will hang.

 It crashed again last night, and I am kind of stumped. I would greatly
 appreciate others thoughts and input on what the problem might be.

 Thanks!
 ~Stack~

 Dec  4 02:25:09 testbox kernel: INFO: task jbd2/cciss!c0d0:273 blocked
 for more than 120 seconds.
 Dec  4 02:25:09 testbox kernel: echo 0 
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Dec  4 02:25:09 testbox kernel: jbd2/cciss!c0 D  0
   273  2 0x
 Dec  4 02:25:09 testbox kernel: 8802142cfb30 0046
 8802138b5800 1000
 Dec  4 02:25:09 testbox kernel: 8802142cfaa0 81012c59
 8802142cfae0 810a2431
 Dec  4 02:25:09 testbox kernel: 880214157058 8802142cffd8
 fb88 880214157058
 Dec  4 02:25:09 testbox kernel: Call Trace:
 Dec  4 02:25:09 testbox kernel: [81012c59] ? read_tsc+0x9/0x20

This looks like some locking issue to me, triggered by something around the 
TSC timer.


This is either a buggy driver (most likely the ccsis driver) or a related 
firmware (read the complete boot log carefully, look after firmware warnings). 
 Or it's a really unstable TSC clock source.  Try switching from TSC to HPET 
(or in really worst case acpi_pm).  See this KB for some related info: 
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Timestamping.html


But my hunch tells me it's a driver related issue, with some bad locking. 
There seems to be several filesystem operations happening on two or more CPU 
cores in a certain order which seems to trigger a deadlock.



--
kind regards,

David Sommerseth


Re: Unexplained Kernel Panic / Hung Task

2013-12-04 Thread Paul Robert Marino
Well I tend to discount the driver idea because of an other problem he has involving multiple what I think are identical machines . Also any problems I've ever had with the ccsis driver were usually firmware related an a update or roll back usually corrects them.Besides the based on what I've heard this is low budget equipment and ProLiants aren't cheap. If I had to guess we are talking about Dells.-- Sent from my HP Pre3On Dec 4, 2013 18:36, David Sommerseth sl+us...@lists.topphemmelig.net wrote: On 04/12/13 14:21, ~Stack~ wrote: Greetings,
 
  I have a test system I use for testing deployments and when I am not
  using it, it runs Boinc. It is a Scientific Linux 6.4 fully updated box.
  Recently (last ~3 weeks) I have started getting the same kernel panic.
  Sometimes it will be multiple times in a single day and other times it
  will be days before the next one (it just had a 5 day uptime). But the
  kernel panic looks pretty much the same. It is a complaint about a hung
  task plus information about the ext4 file system. I have run the
  smartmon tool against both drives (2 drives setup in a hardware RAID
  mirror) and both drives checkout fine. I ran a fsck against the /
  partition and everything looked fine (on this text box there is only /
  and swap partitions). I even took out a drive at a time and had the same
  crashes (though this could be an indicator that both drives are bad). I
  am wondering if my RAID card is going bad.
 
  When the crash happens I still have the SSH prompt, however, I can only
  do basic things like navigating directories and sometimes reading files.
  Writing to a file seems to hang, using tab-autocomplete will frequently
  hang, running most programs (even `init 6` or `top`) will hang.
 
  It crashed again last night, and I am kind of stumped. I would greatly
  appreciate others thoughts and input on what the problem might be.
 
  Thanks!
  ~Stack~
 
  Dec  4 02:25:09 testbox kernel: INFO: task jbd2/cciss!c0d0:273 blocked
  for more than 120 seconds.
  Dec  4 02:25:09 testbox kernel: "echo 0 
  /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  Dec  4 02:25:09 testbox kernel: jbd2/cciss!c0 D  0
273  2 0x
  Dec  4 02:25:09 testbox kernel: 8802142cfb30 0046
  8802138b5800 1000
  Dec  4 02:25:09 testbox kernel: 8802142cfaa0 81012c59
  8802142cfae0 810a2431
  Dec  4 02:25:09 testbox kernel: 880214157058 8802142cffd8
  fb88 880214157058
  Dec  4 02:25:09 testbox kernel: Call Trace:
  Dec  4 02:25:09 testbox kernel: [81012c59] ? read_tsc+0x9/0x20

This looks like some locking issue to me, triggered by something around the 
TSC timer.

This is either a buggy driver (most likely the ccsis driver) or a related 
firmware (read the complete boot log carefully, look after firmware warnings). 
  Or it's a really unstable TSC clock source.  Try switching from TSC to HPET 
(or in really worst case acpi_pm).  See this KB for some related info: 
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Timestamping.html

But my hunch tells me it's a driver related issue, with some bad locking. 
There seems to be several filesystem operations happening on two or more CPU 
cores in a certain order which seems to trigger a deadlock.


--
kind regards,

David Sommerseth

Re: Unexplained Kernel Panic / Hung Task

2013-12-04 Thread Paul Robert Marino
If not down rev it to the same version as the one that works.It isn't hard to do with their utilities because those of us who work in mission critical environment have hammered it into their heads that its an absolute requierment-- Sent from my HP Pre3On Dec 4, 2013 19:12, ~Stack~ i.am.st...@gmail.com wrote: On 12/04/2013 05:51 PM, Paul Robert Marino wrote:
 Well I tend to discount the driver idea because of an other problem he
 has involving multiple what I think are identical machines . Also any
 problems I've ever had with the ccsis driver were usually firmware
 related an a update or roll back usually corrects them.
 Besides the based on what I've heard this is low budget equipment and
 ProLiants aren't cheap. If I had to guess we are talking about Dells. 

You are right, in that I am experiencing two different issues and the
vast majority of my test lab is older cast-away parts. The difference is
that both issues are on very different systems.

The DHCP problem is on a bunch of similar generic Dells. This particular
problem is on a HP Prolient DL360 G4 which its twin (same hardware specs
and thanks to Puppet should be dang-near identical in terms of software)
so far has not displayed this problem.

Because the twin isn't having this problem and the problem only started
~3 weeks ago is why I thought for the last few weeks it was a disk drive
problem.

I am looking up the firmware versions for this box now. I am not hopeful
that I will find a newer firmware for this old of a system though.
Still, totally worth the try! :-)

Thanks!
~Stack~