Hi list! We have a very simple Lustre setup as follows:
Server1 (MGS/MDS) 1 mgs/mds that contains 3 lun's for 2 Lustre filesystems... 1 lun = mgs data 1 lun = home dirs for users 1 lun = research data Server2 (Currently unused) Server3 (OSS for research data - no errors) Server4 (OSS for mds1 that contains homedir data) 12 ost's approximately 1.1T ea. All servers are running Centos 5.1 with 1.6.7.2 rpm's from sun. We also have 5 clients that are running Ubuntu + 2.6.22.19/patchless. Today client1 lost its Lustre mounts (a df -h hangs) but other clients were all ok. On the oss for the homedir data, I saw the following in /var/log/syslog: Oct 5 13:07:48 maglustre04 kernel: Oct 5 13:07:58 maglustre04 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [ll_ost_35:13366] Oct 5 13:07:58 maglustre04 kernel: CPU 1: Oct 5 13:07:58 maglustre04 kernel: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) sunrpc(U) dm_round_robin(U) dm_emc(U) dm_multipath(U ) video(U) sbs(U) backlight(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sg(U) pata_acpi(U ) lpfc(U) ide_cd(U) hpwdt(U) bnx2(U) shpchp(U) cdrom(U) scsi_transport_fc(U) i5000_edac(U) serio_raw(U) edac_mc(U) pcspkr(U) dm_snapshot(U) dm_zero(U) dm_mirror (U) dm_mod(U) ata_piix(U) libata(U) cciss(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) Oct 5 13:07:58 maglustre04 kernel: Pid: 13366, comm: ll_ost_35 Tainted: G 2.6.18-92.1.26.el5_lustre.1.6.7.2smp #1 Oct 5 13:07:58 maglustre04 kernel: RIP: 0010:[<ffffffff8856caed>] [<ffffffff8856caed>] :ptlrpc:ptlrpc_queue_wait+0x93d/0x1690 Oct 5 13:07:58 maglustre04 kernel: RSP: 0018:ffff8101acd09780 EFLAGS: 00000202 Oct 5 13:07:58 maglustre04 kernel: RAX: ffff81051bf1cc00 RBX: ffff8103f1b76000 RCX: 0000000000080000 Oct 5 13:07:58 maglustre04 kernel: RDX: ffff81051bf1cca0 RSI: ffff81023e40fc08 RDI: ffff8103f1b76008 Oct 5 13:07:58 maglustre04 kernel: RBP: ffff8103f1b7605c R08: 00000000ffffffff R09: 0000000000000020 Oct 5 13:07:58 maglustre04 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8103f1b76000 Oct 5 13:07:58 maglustre04 kernel: R13: ffff8103f1b76000 R14: 0000000000000013 R15: ffffffff885657ec Oct 5 13:07:58 maglustre04 kernel: FS: 00002b9509e77220(0000) GS:ffff81052ff9a640(0000) knlGS:0000000000000000 Oct 5 13:07:58 maglustre04 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Oct 5 13:07:58 maglustre04 kernel: CR2: 0000003184c99a60 CR3: 0000000000201000 CR4: 00000000000006e0 Oct 5 13:07:58 maglustre04 kernel: Oct 5 13:07:58 maglustre04 kernel: Call Trace: Oct 5 13:07:58 maglustre04 kernel: [<ffffffff885768b5>] :ptlrpc:lustre_msg_set_opc+0x45/0x120 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff88566e73>] :ptlrpc:ptlrpc_prep_req_pool+0x613/0x6b0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8008abbc>] default_wake_function+0x0/0xe Oct 5 13:07:58 maglustre04 kernel: [<ffffffff88554a87>] :ptlrpc:ldlm_server_glimpse_ast+0x257/0x3a0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff88561953>] :ptlrpc:interval_iterate_reverse+0x73/0x240 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff88549700>] :ptlrpc:ldlm_process_extent_lock+0x0/0xad0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8881818c>] :obdfilter:filter_intent_policy+0x68c/0x7a0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff88536d76>] :ptlrpc:ldlm_lock_enqueue+0x186/0xb00 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff885518ef>] :ptlrpc:ldlm_export_lock_get+0x6f/0xe0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff884ba688>] :obdclass:lustre_hash_add+0x208/0x2d0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8855a490>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x833 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff885585e9>] :ptlrpc:ldlm_handle_enqueue+0xc09/0x1200 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff885751b8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff887d630a>] :ost:ost_handle+0x565a/0x5cd0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff80143b75>] __next_cpu+0x19/0x28 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff80143b75>] __next_cpu+0x19/0x28 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff800898e6>] find_busiest_group+0x20d/0x621 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff88574795>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8857ceea>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8857ed6d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff885812f3>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff80062f4b>] thread_return+0x0/0xdf Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8006d8a2>] do_gettimeofday+0x40/0x8f Oct 5 13:07:58 maglustre04 kernel: [<ffffffff884247c6>] :libcfs:lcw_update_time+0x16/0x100 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff800891f9>] __wake_up_common+0x3e/0x68 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff885847e8>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8008abbc>] default_wake_function+0x0/0xe Oct 5 13:07:58 maglustre04 kernel: [<ffffffff800b4391>] audit_syscall_exit+0x31b/0x336 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff885835d0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Oct 5 13:07:58 maglustre04 kernel: Oct 5 13:07:58 maglustre04 kernel: BUG: soft lockup - CPU#4 stuck for 10s! [ll_ost_90:13421] Oct 5 13:07:58 maglustre04 kernel: CPU 4: Oct 5 13:07:58 maglustre04 kernel: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) sunrpc(U) dm_round_robin(U) dm_emc(U) dm_multipath(U ) video(U) sbs(U) backlight(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sg(U) pata_acpi(U ) lpfc(U) ide_cd(U) hpwdt(U) bnx2(U) shpchp(U) cdrom(U) scsi_transport_fc(U) i5000_edac(U) serio_raw(U) edac_mc(U) pcspkr(U) dm_snapshot(U) dm_zero(U) dm_mirror (U) dm_mod(U) ata_piix(U) libata(U) cciss(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) Oct 5 13:07:58 maglustre04 kernel: Pid: 13421, comm: ll_ost_90 Tainted: G 2.6.18-92.1.26.el5_lustre.1.6.7.2smp #1 Oct 5 13:07:58 maglustre04 kernel: RIP: 0010:[<ffffffff8005475f>] [<ffffffff8005475f>] strrchr+0x19/0x24 Oct 5 13:07:58 maglustre04 kernel: RSP: 0018:ffff8101e534f358 EFLAGS: 00000212 Oct 5 13:07:58 maglustre04 kernel: RAX: ffffffff885a6497 RBX: ffffffff885ae804 RCX: 0000000000000039 Oct 5 13:07:58 maglustre04 kernel: RDX: ffffffff885a6460 RSI: 000000000000002f RDI: ffffffff885a6499 Oct 5 13:07:58 maglustre04 kernel: RBP: 0000010000000100 R08: ffffffff8859ebe0 R09: 00000000000007b7 Oct 5 13:07:58 maglustre04 kernel: R10: ffffffff885ae831 R11: ffffffff885ae804 R12: ffffffff00000107 Oct 5 13:07:58 maglustre04 kernel: R13: ffff8101b0531b58 R14: 00000000000000b4 R15: 000000a800000100 Oct 5 13:07:58 maglustre04 kernel: FS: 00002b9509e77220(0000) GS:ffff81052fe21b40(0000) knlGS:0000000000000000 Oct 5 13:07:58 maglustre04 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Oct 5 13:07:58 maglustre04 kernel: CR2: 0000003184c6bf00 CR3: 0000000000201000 CR4: 00000000000006e0 Oct 5 13:07:58 maglustre04 kernel: Oct 5 13:07:58 maglustre04 kernel: Call Trace: Oct 5 13:07:58 maglustre04 kernel: [<ffffffff884232aa>] :libcfs:libcfs_debug_vmsg2+0x4a/0x980 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff88576005>] :ptlrpc:_debug_req+0x4b5/0x4d0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8002e1d8>] __wake_up+0x38/0x4f Oct 5 13:07:58 maglustre04 kernel: [<ffffffff88576005>] :ptlrpc:_debug_req+0x4b5/0x4d0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff88566242>] :ptlrpc:ptlrpc_expire_one_request+0x1d2/0x530 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff885657ec>] :ptlrpc:ptlrpc_unregister_reply+0x13c/0x9c0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8856929d>] :ptlrpc:ptlrpc_check_reply+0x18d/0x530 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8856caa0>] :ptlrpc:ptlrpc_queue_wait+0x8f0/0x1690 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff885768b5>] :ptlrpc:lustre_msg_set_opc+0x45/0x120 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff88566e73>] :ptlrpc:ptlrpc_prep_req_pool+0x613/0x6b0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8008abbc>] default_wake_function+0x0/0xe Oct 5 13:07:58 maglustre04 kernel: [<ffffffff88554a87>] :ptlrpc:ldlm_server_glimpse_ast+0x257/0x3a0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff88561953>] :ptlrpc:interval_iterate_reverse+0x73/0x240 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff88549700>] :ptlrpc:ldlm_process_extent_lock+0x0/0xad0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8881818c>] :obdfilter:filter_intent_policy+0x68c/0x7a0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff88536d76>] :ptlrpc:ldlm_lock_enqueue+0x186/0xb00 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff885518ef>] :ptlrpc:ldlm_export_lock_get+0x6f/0xe0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff884ba688>] :obdclass:lustre_hash_add+0x208/0x2d0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8855a490>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x833 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff885585e9>] :ptlrpc:ldlm_handle_enqueue+0xc09/0x1200 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff885751b8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff887d630a>] :ost:ost_handle+0x565a/0x5cd0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff80143b75>] __next_cpu+0x19/0x28 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff800898e6>] find_busiest_group+0x20d/0x621 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff88574795>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8857ceea>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8857ed6d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff885812f3>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff80062f4b>] thread_return+0x0/0xdf Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8006d8a2>] do_gettimeofday+0x40/0x8f Oct 5 13:07:58 maglustre04 kernel: [<ffffffff884247c6>] :libcfs:lcw_update_time+0x16/0x100 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff800891f9>] __wake_up_common+0x3e/0x68 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff885847e8>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8008abbc>] default_wake_function+0x0/0xe Oct 5 13:07:58 maglustre04 kernel: [<ffffffff800b4391>] audit_syscall_exit+0x31b/0x336 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff885835d0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Oct 5 13:07:58 maglustre04 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Oct 5 13:07:58 maglustre04 kernel: After searching Bugzilla, it appears it may be bug #19785. Do you guys agree with this? The difference is that the "RIP" line there contains a reference to text.lock.spinklock and for us it contains strrchr (for one thread) and ptlrpc_queue_wait on the other thread. In the meantime, server4 (maglustre04) has two hung threads (100% cpu) which appear to be OST/io related. What is the correct way to resolve this? Thank you, Robert The information contained in this message and its attachments is intended only for the private and confidential use of the intended recipient(s). If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e- mail is strictly prohibited. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
