[Lustre-discuss] Question about sleeping processes
Hi, my system load shows that quite a number of processes are waiting. ps shows me the same number of processes in state D (uniterruptable sleep). All processes are ll_mdt_NN, where NN is a decimal number. In the logs I find the entry ( see log below). My questions are: What causes the problem? Can I kill the hanging processes? System: Luste 1.8.1 on RHEL5.3 thanks for any hints. --- Oct 5 10:28:03 sosmds2 kernel: Lustre: 0:0:(watchdog.c:181:lcw_cb()) Watchdog triggered for pid 28402: it was inactive for 200.00s Oct 5 10:28:03 sosmds2 kernel: ll_mdt_35 D 81000100c980 0 28402 1 28403 28388 (L-TLB) Oct 5 10:28:03 sosmds2 kernel: 81041c723810 0046 7fff Oct 5 10:28:03 sosmds2 kernel: 81041c7237d0 0001 81022f3e60c0 81022f12e080 Oct 5 10:28:03 sosmds2 kernel: 000177b2feff847c 14df 81022f3e62a8 0001028f Oct 5 10:28:03 sosmds2 kernel: Call Trace: Oct 5 10:28:03 sosmds2 kernel: [8008a3ef] default_wake_function+0x0/0xe Oct 5 10:28:03 sosmds2 kernel: [885b1b26] :libcfs:lbug_with_loc+0xc6/0xd0 Oct 5 10:28:03 sosmds2 kernel: [885b9c70] :libcfs:tracefile_init+0x0/0x110 Oct 5 10:28:03 sosmds2 kernel: [88712218] :ptlrpc:lustre_shrink_reply_v2+0xa8/0x240 Oct 5 10:28:03 sosmds2 kernel: [889ec529] :mds:mds_getattr_lock+0xc59/0xce0 Oct 5 10:28:03 sosmds2 kernel: [88710ea4] :ptlrpc:lustre_msg_add_version+0x34/0x110 Oct 5 10:28:03 sosmds2 kernel: [88602923] :lnet:lnet_ni_send+0x93/0xd0 Oct 5 10:28:03 sosmds2 kernel: [88604d23] :lnet:lnet_send+0x973/0x9a0 Oct 5 10:28:03 sosmds2 kernel: [889e6fca] :mds:fixup_handle_for_resent_req+0x5a/0x2c0 Oct 5 10:28:03 sosmds2 kernel: [889f2a76] :mds:mds_intent_policy+0x636/0xc10 Oct 5 10:28:03 sosmds2 kernel: [886d36f6] :ptlrpc:ldlm_resource_putref+0x1b6/0x3a0 Oct 5 10:28:03 sosmds2 kernel: [886d0d46] :ptlrpc:ldlm_lock_enqueue+0x186/0xb30 Oct 5 10:28:03 sosmds2 kernel: [886ecacf] :ptlrpc:ldlm_export_lock_get+0x6f/0xe0 Oct 5 10:28:03 sosmds2 kernel: [8864fe48] :obdclass:lustre_hash_add+0x218/0x2e0 Oct 5 10:28:03 sosmds2 kernel: [886f5530] :ptlrpc:ldlm_server_blocking_ast+0x0/0x83d Oct 5 10:28:03 sosmds2 kernel: [886f3669] :ptlrpc:ldlm_handle_enqueue+0xc19/0x1210 Oct 5 10:28:03 sosmds2 kernel: [889f0630] :mds:mds_handle+0x4080/0x4cb0 Oct 5 10:28:03 sosmds2 kernel: [885e0047] :lvfs:lprocfs_counter_sub+0x57/0x90 Oct 5 10:28:03 sosmds2 kernel: [80148d4f] __next_cpu+0x19/0x28 Oct 5 10:28:03 sosmds2 kernel: [88715a15] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Oct 5 10:28:03 sosmds2 kernel: [80089d89] enqueue_task+0x41/0x56 Oct 5 10:28:03 sosmds2 kernel: [8871a72d] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Oct 5 10:28:03 sosmds2 kernel: [8871ce67] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160 Oct 5 10:28:03 sosmds2 kernel: [8003dc3f] lock_timer_base+0x1b/0x3c Oct 5 10:28:03 sosmds2 kernel: [80088819] __wake_up_common+0x3e/0x68 Oct 5 10:28:03 sosmds2 kernel: [88720908] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Oct 5 10:28:03 sosmds2 kernel: [8008a3ef] default_wake_function+0x0/0xe Oct 5 10:28:03 sosmds2 kernel: [800b48dd] audit_syscall_exit+0x327/0x342 Oct 5 10:28:03 sosmds2 kernel: [8005dfb1] child_rip+0xa/0x11 Oct 5 10:28:03 sosmds2 kernel: [8871f6f0] :ptlrpc:ptlrpc_main+0x0/0x13e0 Oct 5 10:28:03 sosmds2 kernel: [8005dfa7] child_rip+0x0/0x11 -- Dr. Michael Schwartzkopff MultiNET Services GmbH Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany Tel: +49 - 89 - 45 69 11 0 Fax: +49 - 89 - 45 69 11 21 mob: +49 - 174 - 343 28 75 mail: mi...@multinet.de web: www.multinet.de Sitz der Gesellschaft: 85630 Grasbrunn Registergericht: Amtsgericht München HRB 114375 Geschäftsführer: Günter Jurgeneit, Hubert Martens --- PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B Skype: misch42 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Question about sleeping processes
On Tue, 2009-10-06 at 12:48 +0200, Michael Schwartzkopff wrote: Hi, Hi, my system load shows that quite a number of processes are waiting. Blocked. I guess the word waiting is similar. My questions are: What causes the problem? In this case, the thread has lbugged previously. If you look in syslog for node with these processes you should find entries with LBUG and/or ASSERTION messages. These are the defects that are causing the processes to get blocked (uninteruptable sleep) Can I kill the hanging processes? Nope. You have to reboot the node. Please search bugzilla for the LBUG/ASSERTIONs you are getting and if you don't find anything that matches, please file a new bug. Oct 5 10:28:03 sosmds2 kernel: Lustre: 0:0:(watchdog.c:181:lcw_cb()) Watchdog triggered for pid 28402: it was inactive for 200.00s Oct 5 10:28:03 sosmds2 kernel: ll_mdt_35 D 81000100c980 0 28402 1 28403 28388 (L-TLB) Oct 5 10:28:03 sosmds2 kernel: 81041c723810 0046 7fff Oct 5 10:28:03 sosmds2 kernel: 81041c7237d0 0001 81022f3e60c0 81022f12e080 Oct 5 10:28:03 sosmds2 kernel: 000177b2feff847c 14df 81022f3e62a8 0001028f Oct 5 10:28:03 sosmds2 kernel: Call Trace: Oct 5 10:28:03 sosmds2 kernel: [8008a3ef] default_wake_function+0x0/0xe Oct 5 10:28:03 sosmds2 kernel: [885b1b26] :libcfs:lbug_with_loc+0xc6/0xd0 Here's where you can see that the thread has lbugged. b. signature.asc Description: This is a digitally signed message part ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Question about sleeping processes
Am Dienstag, 6. Oktober 2009 16:22:08 schrieb Brian J. Murrell: On Tue, 2009-10-06 at 12:48 +0200, Michael Schwartzkopff wrote: Hi, Hi, my system load shows that quite a number of processes are waiting. Blocked. I guess the word waiting is similar. My questions are: What causes the problem? In this case, the thread has lbugged previously. If you look in syslog for node with these processes you should find entries with LBUG and/or ASSERTION messages. These are the defects that are causing the processes to get blocked (uninteruptable sleep) (...) Here is some additional from the logs. Any ideas about that? Oct 5 10:26:43 sosmds2 kernel: LustreError: 30617:0: (pack_generic.c:655:lustre_shrink_reply_v2()) ASSERTION(msg-lm_bufcount segment) failed Oct 5 10:26:43 sosmds2 kernel: LustreError: 30617:0: (pack_generic.c:655:lustre_shrink_reply_v2()) LBUG Oct 5 10:26:43 sosmds2 kernel: Lustre: 30617:0:(linux- debug.c:264:libcfs_debug_dumpstack()) showing stack for process 30617 Oct 5 10:26:43 sosmds2 kernel: ll_mdt_47 R running task 0 30617 1 30618 30616 (L-TLB) Oct 5 10:26:43 sosmds2 kernel: 0001 000714a28100 0001 Oct 5 10:26:43 sosmds2 kernel: 0001 0086 0012 8102212dfe88 Oct 5 10:26:43 sosmds2 kernel: 0001 802f6aa0 Oct 5 10:26:43 sosmds2 kernel: Call Trace: Oct 5 10:26:43 sosmds2 kernel: [8009daf8] autoremove_wake_function+0x9/0x2e Oct 5 10:26:43 sosmds2 kernel: [80088819] __wake_up_common+0x3e/0x68 Oct 5 10:26:43 sosmds2 kernel: [80088819] __wake_up_common+0x3e/0x68 Oct 5 10:26:43 sosmds2 kernel: [8008f7ac] vprintk+0x2cb/0x317 Oct 5 10:26:43 sosmds2 kernel: [800a540a] kallsyms_lookup+0xc2/0x17b Oct 5 10:26:43 sosmds2 last message repeated 3 times Oct 5 10:26:43 sosmds2 kernel: [8006bb5d] printk_address+0x9f/0xab Oct 5 10:26:43 sosmds2 kernel: [8008f800] printk+0x8/0xbd Oct 5 10:26:43 sosmds2 kernel: [8008f84a] printk+0x52/0xbd Oct 5 10:26:43 sosmds2 kernel: [800a2e08] module_text_address+0x33/0x3c Oct 5 10:26:43 sosmds2 kernel: [8009c088] kernel_text_address+0x1a/0x26 Oct 5 10:26:43 sosmds2 kernel: [8006b843] dump_trace+0x211/0x23a Oct 5 10:26:43 sosmds2 kernel: [8006b8a0] show_trace+0x34/0x47 Oct 5 10:26:43 sosmds2 kernel: [8006b9a5] _show_stack+0xdb/0xea Oct 5 10:26:43 sosmds2 kernel: [885b1ada] :libcfs:lbug_with_loc+0x7a/0xd0 Oct 5 10:26:43 sosmds2 kernel: [885b9c70] :libcfs:tracefile_init+0x0/0x110 Oct 5 10:26:43 sosmds2 kernel: [88712218] :ptlrpc:lustre_shrink_reply_v2+0xa8/0x240 Oct 5 10:26:43 sosmds2 kernel: [889ec529] :mds:mds_getattr_lock+0xc59/0xce0 Oct 5 10:26:43 sosmds2 kernel: [88710ea4] :ptlrpc:lustre_msg_add_version+0x34/0x110 Oct 5 10:26:43 sosmds2 kernel: [88602923] :lnet:lnet_ni_send+0x93/0xd0 Oct 5 10:26:43 sosmds2 kernel: [88604d23] :lnet:lnet_send+0x973/0x9a0 Oct 5 10:26:43 sosmds2 kernel: [8005c2dc] cache_alloc_refill+0x106/0x186 Oct 5 10:26:43 sosmds2 kernel: [889e6fca] :mds:fixup_handle_for_resent_req+0x5a/0x2c0 Oct 5 10:26:43 sosmds2 kernel: [889f2a76] :mds:mds_intent_policy+0x636/0xc10 Oct 5 10:26:43 sosmds2 kernel: [886d36f6] :ptlrpc:ldlm_resource_putref+0x1b6/0x3a0 Oct 5 10:26:43 sosmds2 kernel: [886d0d46] :ptlrpc:ldlm_lock_enqueue+0x186/0xb30 Oct 5 10:26:43 sosmds2 kernel: [886ecacf] :ptlrpc:ldlm_export_lock_get+0x6f/0xe0 Oct 5 10:26:43 sosmds2 kernel: [8864fe48] :obdclass:lustre_hash_add+0x218/0x2e0 Oct 5 10:26:43 sosmds2 kernel: [886f5530] :ptlrpc:ldlm_server_blocking_ast+0x0/0x83d Oct 5 10:26:43 sosmds2 kernel: [886f3669] :ptlrpc:ldlm_handle_enqueue+0xc19/0x1210 Oct 5 10:26:43 sosmds2 kernel: [889f0630] :mds:mds_handle+0x4080/0x4cb0 Oct 5 10:26:43 sosmds2 kernel: [80148d4f] __next_cpu+0x19/0x28 Oct 5 10:26:43 sosmds2 kernel: [80088f32] find_busiest_group+0x20d/0x621 Oct 5 10:26:43 sosmds2 kernel: [88715a15] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Oct 5 10:26:43 sosmds2 kernel: [80089d89] enqueue_task+0x41/0x56 Oct 5 10:26:43 sosmds2 kernel: [8871a72d] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Oct 5 10:26:43 sosmds2 kernel: [8871ce67] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160 Oct 5 10:26:43 sosmds2 kernel: [80063098] thread_return+0x62/0xfe Oct 5 10:26:43 sosmds2 kernel: [80088819] __wake_up_common+0x3e/0x68 Oct 5 10:26:43 sosmds2 kernel: [88720908] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Oct 5 10:26:43 sosmds2 kernel: [8008a3ef] default_wake_function+0x0/0xe Oct 5 10:26:43 sosmds2 kernel: [800b48dd] audit_syscall_exit+0x327/0x342 Oct 5 10:26:43
Re: [Lustre-discuss] Question about sleeping processes
On Tue, 2009-10-06 at 17:01 +0200, Michael Schwartzkopff wrote: Here is some additional from the logs. Any ideas about that? Oct 5 10:26:43 sosmds2 kernel: LustreError: 30617:0: (pack_generic.c:655:lustre_shrink_reply_v2()) ASSERTION(msg-lm_bufcount segment) failed Here's the failed assertion. Oct 5 10:26:43 sosmds2 kernel: LustreError: 30617:0: (pack_generic.c:655:lustre_shrink_reply_v2()) LBUG Which always leads to an LBUG which is what is putting the thread to sleep. Any time you see an LBUG in a server log file, you need to reboot the server. So now you need to take that ASSERTION message to our bugzilla and see if you can find a bug for already, and if not, file a new one, please. Cheers, b. signature.asc Description: This is a digitally signed message part ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Question about sleeping processes
Am Dienstag, 6. Oktober 2009 17:08:44 schrieb Brian J. Murrell: On Tue, 2009-10-06 at 17:01 +0200, Michael Schwartzkopff wrote: Here is some additional from the logs. Any ideas about that? Oct 5 10:26:43 sosmds2 kernel: LustreError: 30617:0: (pack_generic.c:655:lustre_shrink_reply_v2()) ASSERTION(msg-lm_bufcount segment) failed Here's the failed assertion. Oct 5 10:26:43 sosmds2 kernel: LustreError: 30617:0: (pack_generic.c:655:lustre_shrink_reply_v2()) LBUG Which always leads to an LBUG which is what is putting the thread to sleep. Any time you see an LBUG in a server log file, you need to reboot the server. So now you need to take that ASSERTION message to our bugzilla and see if you can find a bug for already, and if not, file a new one, please. Cheers, b. Thanks for your fast reply. I think # 20020 is the one we hit. Waiting for a solution. Greetings, -- Dr. Michael Schwartzkopff MultiNET Services GmbH Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany Tel: +49 - 89 - 45 69 11 0 Fax: +49 - 89 - 45 69 11 21 mob: +49 - 174 - 343 28 75 mail: mi...@multinet.de web: www.multinet.de Sitz der Gesellschaft: 85630 Grasbrunn Registergericht: Amtsgericht München HRB 114375 Geschäftsführer: Günter Jurgeneit, Hubert Martens --- PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B Skype: misch42 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Question about sleeping processes
On Tue, 2009-10-06 at 17:13 +0200, Michael Schwartzkopff wrote: Thanks for your fast reply. I think # 20020 is the one we hit. Certainly seems so. Waiting for a solution. There are patches landed for that bug, but it would appear that it's been reopened. You could CC yourself to that bug to follow progress. b. signature.asc Description: This is a digitally signed message part ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss