Re: INFO: task blocked for more than 120 seconds.

2010-09-01 Thread Winnie Lacesso
Good day,

Someone in our local LUG recently posed about this, he was seeing

 Jul 21 07:39:04 linux kernel: INFO: task mount:5258 blocked for more than 120 
 seconds.
 Jul 21 07:39:04 linux kernel: echo 0  
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Jul 21 07:39:04 linux kernel: mount D 0002 0 / 5258   
 4842 0x0080
 Jul 21 07:39:04 linux kernel: 880037b4fb28 0082 / 
 880037b4fa78 812d6ea0
 Jul 21 07:39:04 linux kernel: 880037b4fad8 812ef2f9 / 
 880037b4ffd8 880037b4ffd8

He found a solution:

 Googling around suggests that it might be a recurrent kernel bug (as
 it appears to come and go between versions) and that it might be related to
 the method the kernel uses to idle the processor. The most power-efficient
 technique is to use the MWAIT instruction, which is the default on systems
 that support it. The least power-efficient (and most performant) is use a
 polling idle loop. In between is to use the HALT instruction.
 
 Since starting to boot my kernel with
 
   idle=halt
 
 (and pcie_aspm=off, FWIW, but I don't think my system was using it anyway),
 I haven't had any recurrance.

Does that help?


Re: INFO: task blocked for more than 120 seconds.

2010-08-31 Thread Yannick Perret

Steven Timm a écrit :

Hi Doug--I have seen the same message on some of our machines but
so far it hasn't caused any real performance problems up until now.
It's not so much if you are running SL5.5 but just as long as you
are running some of the latest errata kernels.. we only
saw it show up on SL5.3 but with the latest errata kernel.


Hello,

here at CC-IN2P3 we met this problem. It appears 3 or 4 release back, 
and was never corrected in latests updates.
The problem seems to come from mainline kernel stuff (I guess backported 
with other features), as I saw some similar bug reports on mainline 
kernel lists.


We also met the task blocked messages, and / or increasing of system 
load, and deeply locked tasks that make 'ps' or similar commands to hang.


The only way we found to reduce the problem was to tune the kernel 
parameters (in particular parameters related to VM). But it only reduced 
the problem.


--
Y.




On Mon, 30 Aug 2010, Doug Johnson wrote:


Greetings,

I am seeing the following messsage when an SL5.5 (all of the most recent
updates are installed) is under load writing data to an NFS disk:

NOTE: It occurs for other processes than kswapd0, so I don't think that
has anything to do with the issue.

Aug 30 18:25:21 se kernel: INFO: task kswapd0:220 blocked for more 
than 120 seconds.
Aug 30 18:25:21 se kernel: echo 0  
/proc/sys/kernel/hung_task_timeout_secs disables this message.
Aug 30 18:25:21 se kernel: kswapd0   D 810003336420 0   
220 36   221   219 (L-TLB)
Aug 30 18:25:21 se kernel:  810003be19e0 0046 
810037c9c200 8100ae3c4000
Aug 30 18:25:21 se kernel:  0003 000a 
810037f2a860 80308b60
Aug 30 18:25:21 se kernel:  0a919f5c3fe1 002d7d53 
810037f2aa48 c770f5f8

Aug 30 18:25:21 se kernel: Call Trace:
Aug 30 18:25:21 se kernel:  [8006e1db] 
do_gettimeofday+0x40/0x90
Aug 30 18:25:21 se kernel:  [886646e5] 
:nfs:nfs_wait_bit_uninterruptible+0x0/0xd

Aug 30 18:25:21 se kernel:  [800637ea] io_schedule+0x3f/0x67
Aug 30 18:25:21 se kernel:  [886646ee] 
:nfs:nfs_wait_bit_uninterruptible+0x9/0xd

Aug 30 18:25:21 se kernel:  [80063a16] __wait_on_bit+0x40/0x6e
Aug 30 18:25:21 se kernel:  [886646e5] 
:nfs:nfs_wait_bit_uninterruptible+0x0/0xd
Aug 30 18:25:21 se kernel:  [80063ab0] 
out_of_line_wait_on_bit+0x6c/0x78
Aug 30 18:25:21 se kernel:  [800a0a06] 
wake_bit_function+0x0/0x23
Aug 30 18:25:21 se kernel:  [88668106] 
:nfs:nfs_wait_on_requests_locked+0x70/0xca
Aug 30 18:25:21 se kernel:  [88669146] 
:nfs:nfs_sync_inode_wait+0x60/0x1db
Aug 30 18:25:21 se kernel:  [8865f234] 
:nfs:nfs_release_page+0x2c/0x4d
Aug 30 18:25:21 se kernel:  [800caea8] 
shrink_inactive_list+0x511/0x8d8
Aug 30 18:25:21 se kernel:  [800ca39b] 
isolate_lru_pages+0x98/0xbf
Aug 30 18:25:21 se kernel:  [80047e98] 
__pagevec_release+0x19/0x22
Aug 30 18:25:21 se kernel:  [800ca876] 
shrink_active_list+0x4b4/0x4c4

Aug 30 18:25:21 se kernel:  [800130f5] shrink_zone+0x127/0x18d
Aug 30 18:25:21 se kernel:  [80057b94] kswapd+0x323/0x46c
Aug 30 18:25:21 se kernel:  [800a09d8] 
autoremove_wake_function+0x0/0x2e
Aug 30 18:25:21 se kernel:  [800a07c0] 
keventd_create_kthread+0x0/0xc4

Aug 30 18:25:21 se kernel:  [80057871] kswapd+0x0/0x46c
Aug 30 18:25:21 se kernel:  [800a07c0] 
keventd_create_kthread+0x0/0xc4

Aug 30 18:25:21 se kernel:  [8003287b] kthread+0xfe/0x132
Aug 30 18:25:21 se kernel:  [8005dfb1] child_rip+0xa/0x11
Aug 30 18:25:21 se kernel:  [800a07c0] 
keventd_create_kthread+0x0/0xc4

Aug 30 18:25:21 se kernel:  [8003277d] kthread+0x0/0x132
Aug 30 18:25:21 se kernel:  [8005dfa7] child_rip+0x0/0x11

I have seen this error with both an Intel Pro1000 and a Realtek Ethernet
card.

I am doing work with 2 other different Universities (completely
different hardware) and they have all seen this message. Prior to 5.5,
this would result in the machine locking up. Now with 5.5 it appears
that the load level on the machine slowly rises (I assume due to D wait
state blocked processes), but the machine is somewhat responsive. Also
once these messages occur, ps will hang and that session becomes
unusable.

I don't what this means, but a similarly configured machine with
identical hardware running SL4.7 does not produce these errors and the
NFS throughput is pretty darn good.

Any help or pointers in some direction will be appreciated,
Thanks,
doug

 


  Doug Johnsonemail: drj...@pizero.colorado.edu
  B390, Duane Physics (303)-492-4506 Office
  Boulder, CO 80309   (303)-492-5119 FAX
  http://www.aaccchildren.org
  Tully, baby. Look around. It's a cage with golden 

Re: INFO: task blocked for more than 120 seconds.

2010-08-30 Thread Steven Timm

Hi Doug--I have seen the same message on some of our machines but
so far it hasn't caused any real performance problems up until now.
It's not so much if you are running SL5.5 but just as long as you
are running some of the latest errata kernels.. we only
saw it show up on SL5.3 but with the latest errata kernel.

Steve



Steve


On Mon, 30 Aug 2010, Doug Johnson wrote:


Greetings,

I am seeing the following messsage when an SL5.5 (all of the most recent
updates are installed) is under load writing data to an NFS disk:

NOTE: It occurs for other processes than kswapd0, so I don't think that
has anything to do with the issue.

Aug 30 18:25:21 se kernel: INFO: task kswapd0:220 blocked for more than 120 
seconds.
Aug 30 18:25:21 se kernel: echo 0  /proc/sys/kernel/hung_task_timeout_secs 
disables this message.
Aug 30 18:25:21 se kernel: kswapd0   D 810003336420 0   220 36  
 221   219 (L-TLB)
Aug 30 18:25:21 se kernel:  810003be19e0 0046 810037c9c200 
8100ae3c4000
Aug 30 18:25:21 se kernel:  0003 000a 810037f2a860 
80308b60
Aug 30 18:25:21 se kernel:  0a919f5c3fe1 002d7d53 810037f2aa48 
c770f5f8
Aug 30 18:25:21 se kernel: Call Trace:
Aug 30 18:25:21 se kernel:  [8006e1db] do_gettimeofday+0x40/0x90
Aug 30 18:25:21 se kernel:  [886646e5] 
:nfs:nfs_wait_bit_uninterruptible+0x0/0xd
Aug 30 18:25:21 se kernel:  [800637ea] io_schedule+0x3f/0x67
Aug 30 18:25:21 se kernel:  [886646ee] 
:nfs:nfs_wait_bit_uninterruptible+0x9/0xd
Aug 30 18:25:21 se kernel:  [80063a16] __wait_on_bit+0x40/0x6e
Aug 30 18:25:21 se kernel:  [886646e5] 
:nfs:nfs_wait_bit_uninterruptible+0x0/0xd
Aug 30 18:25:21 se kernel:  [80063ab0] 
out_of_line_wait_on_bit+0x6c/0x78
Aug 30 18:25:21 se kernel:  [800a0a06] wake_bit_function+0x0/0x23
Aug 30 18:25:21 se kernel:  [88668106] 
:nfs:nfs_wait_on_requests_locked+0x70/0xca
Aug 30 18:25:21 se kernel:  [88669146] 
:nfs:nfs_sync_inode_wait+0x60/0x1db
Aug 30 18:25:21 se kernel:  [8865f234] :nfs:nfs_release_page+0x2c/0x4d
Aug 30 18:25:21 se kernel:  [800caea8] 
shrink_inactive_list+0x511/0x8d8
Aug 30 18:25:21 se kernel:  [800ca39b] isolate_lru_pages+0x98/0xbf
Aug 30 18:25:21 se kernel:  [80047e98] __pagevec_release+0x19/0x22
Aug 30 18:25:21 se kernel:  [800ca876] shrink_active_list+0x4b4/0x4c4
Aug 30 18:25:21 se kernel:  [800130f5] shrink_zone+0x127/0x18d
Aug 30 18:25:21 se kernel:  [80057b94] kswapd+0x323/0x46c
Aug 30 18:25:21 se kernel:  [800a09d8] 
autoremove_wake_function+0x0/0x2e
Aug 30 18:25:21 se kernel:  [800a07c0] keventd_create_kthread+0x0/0xc4
Aug 30 18:25:21 se kernel:  [80057871] kswapd+0x0/0x46c
Aug 30 18:25:21 se kernel:  [800a07c0] keventd_create_kthread+0x0/0xc4
Aug 30 18:25:21 se kernel:  [8003287b] kthread+0xfe/0x132
Aug 30 18:25:21 se kernel:  [8005dfb1] child_rip+0xa/0x11
Aug 30 18:25:21 se kernel:  [800a07c0] keventd_create_kthread+0x0/0xc4
Aug 30 18:25:21 se kernel:  [8003277d] kthread+0x0/0x132
Aug 30 18:25:21 se kernel:  [8005dfa7] child_rip+0x0/0x11

I have seen this error with both an Intel Pro1000 and a Realtek Ethernet
card.

I am doing work with 2 other different Universities (completely
different hardware) and they have all seen this message. Prior to 5.5,
this would result in the machine locking up. Now with 5.5 it appears
that the load level on the machine slowly rises (I assume due to D wait
state blocked processes), but the machine is somewhat responsive. Also
once these messages occur, ps will hang and that session becomes
unusable.

I don't what this means, but a similarly configured machine with
identical hardware running SL4.7 does not produce these errors and the
NFS throughput is pretty darn good.

Any help or pointers in some direction will be appreciated,
Thanks,
doug


  Doug Johnsonemail: drj...@pizero.colorado.edu
  B390, Duane Physics (303)-492-4506 Office
  Boulder, CO 80309   (303)-492-5119 FAX
  http://www.aaccchildren.org
  Tully, baby. Look around. It's a cage with golden bars.




--
--
Steven C. Timm, Ph.D  (630) 840-8525
t...@fnal.gov  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.