[Lustre-discuss] Question about sleeping processes

2009-10-06 Thread Michael Schwartzkopff
Hi,

my system load shows that quite a number of processes are waiting. ps shows me 
the same number of processes in state D (uniterruptable sleep). All processes 
are ll_mdt_NN, where NN is a decimal number.

In the logs I find the entry ( see log below).

My questions are:
What causes the problem?
Can I kill the hanging processes?

System: Luste 1.8.1 on RHEL5.3

thanks for any hints.

---

Oct  5 10:28:03 sosmds2 kernel: Lustre: 0:0:(watchdog.c:181:lcw_cb()) Watchdog 
triggered for pid 28402: it was inactive for 200.00s
Oct  5 10:28:03 sosmds2 kernel: ll_mdt_35 D 81000100c980 0 28402
  
1 28403 28388 (L-TLB)
Oct  5 10:28:03 sosmds2 kernel:  81041c723810 0046 
 7fff
Oct  5 10:28:03 sosmds2 kernel:  81041c7237d0 0001 
81022f3e60c0 81022f12e080
Oct  5 10:28:03 sosmds2 kernel:  000177b2feff847c 14df 
81022f3e62a8 0001028f
Oct  5 10:28:03 sosmds2 kernel: Call Trace:
Oct  5 10:28:03 sosmds2 kernel:  [8008a3ef] 
default_wake_function+0x0/0xe
Oct  5 10:28:03 sosmds2 kernel:  [885b1b26] 
:libcfs:lbug_with_loc+0xc6/0xd0
Oct  5 10:28:03 sosmds2 kernel:  [885b9c70] 
:libcfs:tracefile_init+0x0/0x110
Oct  5 10:28:03 sosmds2 kernel:  [88712218] 
:ptlrpc:lustre_shrink_reply_v2+0xa8/0x240
Oct  5 10:28:03 sosmds2 kernel:  [889ec529] 
:mds:mds_getattr_lock+0xc59/0xce0
Oct  5 10:28:03 sosmds2 kernel:  [88710ea4] 
:ptlrpc:lustre_msg_add_version+0x34/0x110
Oct  5 10:28:03 sosmds2 kernel:  [88602923] 
:lnet:lnet_ni_send+0x93/0xd0
Oct  5 10:28:03 sosmds2 kernel:  [88604d23] 
:lnet:lnet_send+0x973/0x9a0
Oct  5 10:28:03 sosmds2 kernel:  [889e6fca] 
:mds:fixup_handle_for_resent_req+0x5a/0x2c0
Oct  5 10:28:03 sosmds2 kernel:  [889f2a76] 
:mds:mds_intent_policy+0x636/0xc10
Oct  5 10:28:03 sosmds2 kernel:  [886d36f6] 
:ptlrpc:ldlm_resource_putref+0x1b6/0x3a0
Oct  5 10:28:03 sosmds2 kernel:  [886d0d46] 
:ptlrpc:ldlm_lock_enqueue+0x186/0xb30
Oct  5 10:28:03 sosmds2 kernel:  [886ecacf] 
:ptlrpc:ldlm_export_lock_get+0x6f/0xe0
Oct  5 10:28:03 sosmds2 kernel:  [8864fe48] 
:obdclass:lustre_hash_add+0x218/0x2e0
Oct  5 10:28:03 sosmds2 kernel:  [886f5530] 
:ptlrpc:ldlm_server_blocking_ast+0x0/0x83d
Oct  5 10:28:03 sosmds2 kernel:  [886f3669] 
:ptlrpc:ldlm_handle_enqueue+0xc19/0x1210
Oct  5 10:28:03 sosmds2 kernel:  [889f0630] 
:mds:mds_handle+0x4080/0x4cb0
Oct  5 10:28:03 sosmds2 kernel:  [885e0047] 
:lvfs:lprocfs_counter_sub+0x57/0x90
Oct  5 10:28:03 sosmds2 kernel:  [80148d4f] __next_cpu+0x19/0x28
Oct  5 10:28:03 sosmds2 kernel:  [88715a15] 
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Oct  5 10:28:03 sosmds2 kernel:  [80089d89] enqueue_task+0x41/0x56
Oct  5 10:28:03 sosmds2 kernel:  [8871a72d] 
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Oct  5 10:28:03 sosmds2 kernel:  [8871ce67] 
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160
Oct  5 10:28:03 sosmds2 kernel:  [8003dc3f] lock_timer_base+0x1b/0x3c
Oct  5 10:28:03 sosmds2 kernel:  [80088819] __wake_up_common+0x3e/0x68
Oct  5 10:28:03 sosmds2 kernel:  [88720908] 
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Oct  5 10:28:03 sosmds2 kernel:  [8008a3ef] 
default_wake_function+0x0/0xe
Oct  5 10:28:03 sosmds2 kernel:  [800b48dd] 
audit_syscall_exit+0x327/0x342
Oct  5 10:28:03 sosmds2 kernel:  [8005dfb1] child_rip+0xa/0x11
Oct  5 10:28:03 sosmds2 kernel:  [8871f6f0] 
:ptlrpc:ptlrpc_main+0x0/0x13e0
Oct  5 10:28:03 sosmds2 kernel:  [8005dfa7] child_rip+0x0/0x11


-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: mi...@multinet.de
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht München HRB 114375
Geschäftsführer: Günter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Question about sleeping processes

2009-10-06 Thread Brian J. Murrell
On Tue, 2009-10-06 at 12:48 +0200, Michael Schwartzkopff wrote:
 Hi,

Hi,

 my system load shows that quite a number of processes are waiting.

Blocked.  I guess the word waiting is similar.

 My questions are:
 What causes the problem?

In this case, the thread has lbugged previously.

If you look in syslog for node with these processes you should find
entries with LBUG and/or ASSERTION messages.  These are the defects that
are causing the processes to get blocked (uninteruptable sleep)

 Can I kill the hanging processes?

Nope.  You have to reboot the node.

Please search bugzilla for the LBUG/ASSERTIONs you are getting and if
you don't find anything that matches, please file a new bug.

 Oct  5 10:28:03 sosmds2 kernel: Lustre: 0:0:(watchdog.c:181:lcw_cb()) 
 Watchdog 
 triggered for pid 28402: it was inactive for 200.00s
 Oct  5 10:28:03 sosmds2 kernel: ll_mdt_35 D 81000100c980 0 28402  
 
 1 28403 28388 (L-TLB)
 Oct  5 10:28:03 sosmds2 kernel:  81041c723810 0046 
  7fff
 Oct  5 10:28:03 sosmds2 kernel:  81041c7237d0 0001 
 81022f3e60c0 81022f12e080
 Oct  5 10:28:03 sosmds2 kernel:  000177b2feff847c 14df 
 81022f3e62a8 0001028f
 Oct  5 10:28:03 sosmds2 kernel: Call Trace:
 Oct  5 10:28:03 sosmds2 kernel:  [8008a3ef] 
 default_wake_function+0x0/0xe
 Oct  5 10:28:03 sosmds2 kernel:  [885b1b26] 
 :libcfs:lbug_with_loc+0xc6/0xd0

Here's where you can see that the thread has lbugged.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Question about sleeping processes

2009-10-06 Thread Michael Schwartzkopff
Am Dienstag, 6. Oktober 2009 16:22:08 schrieb Brian J. Murrell:
 On Tue, 2009-10-06 at 12:48 +0200, Michael Schwartzkopff wrote:
  Hi,

 Hi,

  my system load shows that quite a number of processes are waiting.

 Blocked.  I guess the word waiting is similar.

  My questions are:
  What causes the problem?

 In this case, the thread has lbugged previously.

 If you look in syslog for node with these processes you should find
 entries with LBUG and/or ASSERTION messages.  These are the defects that
 are causing the processes to get blocked (uninteruptable sleep)
(...)

Here is some additional from the logs. Any ideas about that?

Oct  5 10:26:43 sosmds2 kernel: LustreError: 30617:0:
(pack_generic.c:655:lustre_shrink_reply_v2()) ASSERTION(msg-lm_bufcount  
segment) failed
Oct  5 10:26:43 sosmds2 kernel: LustreError: 30617:0:
(pack_generic.c:655:lustre_shrink_reply_v2()) LBUG
Oct  5 10:26:43 sosmds2 kernel: Lustre: 30617:0:(linux-
debug.c:264:libcfs_debug_dumpstack()) showing stack for process 30617
Oct  5 10:26:43 sosmds2 kernel: ll_mdt_47 R  running task   0 30617 
 
1 30618 30616 (L-TLB)
Oct  5 10:26:43 sosmds2 kernel:   0001 
000714a28100 0001
Oct  5 10:26:43 sosmds2 kernel:  0001 0086 
0012 8102212dfe88
Oct  5 10:26:43 sosmds2 kernel:  0001  
802f6aa0 
Oct  5 10:26:43 sosmds2 kernel: Call Trace:
Oct  5 10:26:43 sosmds2 kernel:  [8009daf8] 
autoremove_wake_function+0x9/0x2e
Oct  5 10:26:43 sosmds2 kernel:  [80088819] __wake_up_common+0x3e/0x68
Oct  5 10:26:43 sosmds2 kernel:  [80088819] __wake_up_common+0x3e/0x68
Oct  5 10:26:43 sosmds2 kernel:  [8008f7ac] vprintk+0x2cb/0x317
Oct  5 10:26:43 sosmds2 kernel:  [800a540a] kallsyms_lookup+0xc2/0x17b
Oct  5 10:26:43 sosmds2 last message repeated 3 times
Oct  5 10:26:43 sosmds2 kernel:  [8006bb5d] printk_address+0x9f/0xab
Oct  5 10:26:43 sosmds2 kernel:  [8008f800] printk+0x8/0xbd
Oct  5 10:26:43 sosmds2 kernel:  [8008f84a] printk+0x52/0xbd
Oct  5 10:26:43 sosmds2 kernel:  [800a2e08] 
module_text_address+0x33/0x3c
Oct  5 10:26:43 sosmds2 kernel:  [8009c088] 
kernel_text_address+0x1a/0x26
Oct  5 10:26:43 sosmds2 kernel:  [8006b843] dump_trace+0x211/0x23a
Oct  5 10:26:43 sosmds2 kernel:  [8006b8a0] show_trace+0x34/0x47
Oct  5 10:26:43 sosmds2 kernel:  [8006b9a5] _show_stack+0xdb/0xea
Oct  5 10:26:43 sosmds2 kernel:  [885b1ada] 
:libcfs:lbug_with_loc+0x7a/0xd0
Oct  5 10:26:43 sosmds2 kernel:  [885b9c70] 
:libcfs:tracefile_init+0x0/0x110
Oct  5 10:26:43 sosmds2 kernel:  [88712218] 
:ptlrpc:lustre_shrink_reply_v2+0xa8/0x240
Oct  5 10:26:43 sosmds2 kernel:  [889ec529] 
:mds:mds_getattr_lock+0xc59/0xce0
Oct  5 10:26:43 sosmds2 kernel:  [88710ea4] 
:ptlrpc:lustre_msg_add_version+0x34/0x110
Oct  5 10:26:43 sosmds2 kernel:  [88602923] 
:lnet:lnet_ni_send+0x93/0xd0
Oct  5 10:26:43 sosmds2 kernel:  [88604d23] 
:lnet:lnet_send+0x973/0x9a0
Oct  5 10:26:43 sosmds2 kernel:  [8005c2dc] 
cache_alloc_refill+0x106/0x186
Oct  5 10:26:43 sosmds2 kernel:  [889e6fca] 
:mds:fixup_handle_for_resent_req+0x5a/0x2c0
Oct  5 10:26:43 sosmds2 kernel:  [889f2a76] 
:mds:mds_intent_policy+0x636/0xc10
Oct  5 10:26:43 sosmds2 kernel:  [886d36f6] 
:ptlrpc:ldlm_resource_putref+0x1b6/0x3a0
Oct  5 10:26:43 sosmds2 kernel:  [886d0d46] 
:ptlrpc:ldlm_lock_enqueue+0x186/0xb30
Oct  5 10:26:43 sosmds2 kernel:  [886ecacf] 
:ptlrpc:ldlm_export_lock_get+0x6f/0xe0
Oct  5 10:26:43 sosmds2 kernel:  [8864fe48] 
:obdclass:lustre_hash_add+0x218/0x2e0
Oct  5 10:26:43 sosmds2 kernel:  [886f5530] 
:ptlrpc:ldlm_server_blocking_ast+0x0/0x83d
Oct  5 10:26:43 sosmds2 kernel:  [886f3669] 
:ptlrpc:ldlm_handle_enqueue+0xc19/0x1210
Oct  5 10:26:43 sosmds2 kernel:  [889f0630] 
:mds:mds_handle+0x4080/0x4cb0
Oct  5 10:26:43 sosmds2 kernel:  [80148d4f] __next_cpu+0x19/0x28
Oct  5 10:26:43 sosmds2 kernel:  [80088f32] 
find_busiest_group+0x20d/0x621
Oct  5 10:26:43 sosmds2 kernel:  [88715a15] 
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Oct  5 10:26:43 sosmds2 kernel:  [80089d89] enqueue_task+0x41/0x56
Oct  5 10:26:43 sosmds2 kernel:  [8871a72d] 
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Oct  5 10:26:43 sosmds2 kernel:  [8871ce67] 
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160
Oct  5 10:26:43 sosmds2 kernel:  [80063098] thread_return+0x62/0xfe
Oct  5 10:26:43 sosmds2 kernel:  [80088819] __wake_up_common+0x3e/0x68
Oct  5 10:26:43 sosmds2 kernel:  [88720908] 
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Oct  5 10:26:43 sosmds2 kernel:  [8008a3ef] 
default_wake_function+0x0/0xe
Oct  5 10:26:43 sosmds2 kernel:  [800b48dd] 
audit_syscall_exit+0x327/0x342
Oct  5 10:26:43 

Re: [Lustre-discuss] Question about sleeping processes

2009-10-06 Thread Brian J. Murrell
On Tue, 2009-10-06 at 17:01 +0200, Michael Schwartzkopff wrote:
 
 Here is some additional from the logs. Any ideas about that?
 
 Oct  5 10:26:43 sosmds2 kernel: LustreError: 30617:0:
 (pack_generic.c:655:lustre_shrink_reply_v2()) ASSERTION(msg-lm_bufcount  
 segment) failed

Here's the failed assertion.

 Oct  5 10:26:43 sosmds2 kernel: LustreError: 30617:0:
 (pack_generic.c:655:lustre_shrink_reply_v2()) LBUG

Which always leads to an LBUG which is what is putting the thread to
sleep.

Any time you see an LBUG in a server log file, you need to reboot the
server.

So now you need to take that ASSERTION message to our bugzilla and see
if you can find a bug for already, and if not, file a new one, please.

Cheers,
b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Question about sleeping processes

2009-10-06 Thread Michael Schwartzkopff
Am Dienstag, 6. Oktober 2009 17:08:44 schrieb Brian J. Murrell:
 On Tue, 2009-10-06 at 17:01 +0200, Michael Schwartzkopff wrote:
  Here is some additional from the logs. Any ideas about that?
 
  Oct  5 10:26:43 sosmds2 kernel: LustreError: 30617:0:
  (pack_generic.c:655:lustre_shrink_reply_v2()) ASSERTION(msg-lm_bufcount
   segment) failed

 Here's the failed assertion.

  Oct  5 10:26:43 sosmds2 kernel: LustreError: 30617:0:
  (pack_generic.c:655:lustre_shrink_reply_v2()) LBUG

 Which always leads to an LBUG which is what is putting the thread to
 sleep.

 Any time you see an LBUG in a server log file, you need to reboot the
 server.

 So now you need to take that ASSERTION message to our bugzilla and see
 if you can find a bug for already, and if not, file a new one, please.

 Cheers,
 b.

Thanks for your fast reply. I think # 20020 is the one we hit.
Waiting for a solution.
Greetings,
-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: mi...@multinet.de
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht München HRB 114375
Geschäftsführer: Günter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Question about sleeping processes

2009-10-06 Thread Brian J. Murrell
On Tue, 2009-10-06 at 17:13 +0200, Michael Schwartzkopff wrote:
 
 Thanks for your fast reply. I think # 20020 is the one we hit.

Certainly seems so.

 Waiting for a solution.

There are patches landed for that bug, but it would appear that it's
been reopened.  You could CC yourself to that bug to follow progress.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss