hi All,
I have made a new small cluster in the close past for backup purposes, it has only 2 client nodes. It's based on CentOS 5.1 and 1.6.6, meta datas are on drbd. The freezed OST node is the secondary node for the drbd. I' quite sure, this is probalbly not the problem, this scenario was working for months with Lustre 1.6.4.3. When I was making the initial backup (a bit bigger and continuously traffic) on of the OST node get freezed up. Actually not the whole node, but the 2 OST mounts: /dev/sdc1 on /mnt/ost-0031 type lustre (rw) /dev/sdd1 on /mnt/ost-0032 type lustre (rw) The mounts cannot be umounted even with 'umount -f'. In logs I see: Jan 3 11:15:36 backup3 kernel: LustreError: 3271:0:(socklnd.c:1631:ksocknal_destroy_conn()) Completing partial receive from 12345-192.168.0...@tcp, ip 192.16 8.0.24:1021, with error Jan 3 11:15:36 backup3 kernel: LustreError: 3271:0:(events.c:372:server_bulk_callback()) event type 2, status -5, desc ffff810026250000 Jan 3 11:15:36 backup3 kernel: LustreError: 11080:0:(ost_handler.c:1153:ost_brw_write()) @@@ network error on bulk GET 0(1048576) r...@ffff81007ad24450 x2223 2997/t0 o4->32625da1-9d50-a532-53d9-cdd307d13...@net_0x20000c0a80018_uuid:0/0 lens 384/352 e 0 to 0 dl 1230977776 ref 1 fl Interpret:/0/0 rc 0/0 Jan 3 11:15:36 backup3 kernel: Lustre: 11080:0:(ost_handler.c:1284:ost_brw_write()) backup-OST0031: ignoring bulk IO comm error with 32625da1-9d50-a532-53d9- cdd307d13...@net_0x20000c0a80018_uuid id 12345-192.168.0...@tcp - client will retry Jan 3 11:15:41 backup3 kernel: Lustre: 11034:0:(ldlm_lib.c:538:target_handle_reconnect()) backup-OST0031: 32625da1-9d50-a532-53d9-cdd307d13ed4 reconnecting Jan 3 11:15:41 backup3 kernel: Lustre: 11034:0:(ldlm_lib.c:773:target_handle_connect()) backup-OST0031: refuse reconnection from 32625da1-9d50-a532-53d9-cdd3 [email protected]@tcp to 0xffff810041e7c000; still busy with 2 active RPCs Jan 3 11:15:41 backup3 kernel: LustreError: 11034:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-16) r...@ffff810040ece600 x22233043/t0 o 8->32625da1-9d50-a532-53d9-cdd307d13...@net_0x20000c0a80018_uuid:0/0 lens 304/200 e 0 to 0 dl 1230977841 ref 1 fl Interpret:/0/0 rc -16/0 Jan 3 11:16:06 backup3 kernel: Lustre: 11038:0:(ldlm_lib.c:538:target_handle_reconnect()) backup-OST0031: 32625da1-9d50-a532-53d9-cdd307d13ed4 reconnecting Jan 3 11:16:06 backup3 kernel: Lustre: 11038:0:(ldlm_lib.c:773:target_handle_connect()) backup-OST0031: refuse reconnection from 32625da1-9d50-a532-53d9-cdd3 [email protected]@tcp to 0xffff810041e7c000; still busy with 2 active RPCs Jan 3 11:16:06 backup3 kernel: LustreError: 11038:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-16) r...@ffff81001aed8e00 x22233117/t0 o 8->32625da1-9d50-a532-53d9-cdd307d13...@net_0x20000c0a80018_uuid:0/0 lens 304/200 e 0 to 0 dl 1230977866 ref 1 fl Interpret:/0/0 rc -16/0 Jan 3 11:16:16 backup3 kernel: LustreError: 11083:0:(ost_handler.c:1142:ost_brw_write()) @@@ timeout on bulk GET after 100+0s r...@ffff810041edd800 x22232986 /t0 o4->32625da1-9d50-a532-53d9-cdd307d13...@net_0x20000c0a80018_uuid:0/0 lens 384/352 e 0 to 0 dl 1230977776 ref 1 fl Interpret:/0/0 rc 0/0 Jan 3 11:16:16 backup3 kernel: Lustre: 11083:0:(ost_handler.c:1284:ost_brw_write()) backup-OST0031: ignoring bulk IO comm error with 32625da1-9d50-a532-53d9- cdd307d13...@net_0x20000c0a80018_uuid id 12345-192.168.0...@tcp - client will retry Jan 3 11:16:40 backup3 kernel: Lustre: 11037:0:(ldlm_lib.c:538:target_handle_reconnect()) backup-OST0031: 32625da1-9d50-a532-53d9-cdd307d13ed4 reconnecting Jan 3 11:16:40 backup3 kernel: Lustre: 11037:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 1 previous similar message And after that meny of these: Jan 3 23:11:08 backup3 kernel: BUG: soft lockup - CPU#0 stuck for 10s! [ll_ost_io_08:11074] Jan 3 23:11:08 backup3 kernel: CPU 0: Jan 3 23:11:08 backup3 kernel: Modules linked in: obdfilter(FU) fsfilt_ldiskfs(FU) ost(FU) mgc(FU) ldiskfs(FU) crc16(U) lustre(FU) lov(FU) mdc(FU) lquota(FU) osc(FU) ksocklnd(FU) ptlrpc(FU) obdclass(FU) lnet(FU) lvfs(FU) libcfs(FU) drbd(U) nfsd(U) exportfs(U) lockd(U) nfs_acl(U) auth_rpcgss(U) ipv6(U) xfrm_nalgo(U ) crypto_api(U) autofs4(U) hidp(U) l2cap(U) bluetooth(U) sunrpc(U) dm_mirror(U) dm_multipath(U) dm_mod(U) video(U) sbs(U) backlight(U) i2c_ec(U) button(U) bat tery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) lp(U) snd_intel8x0(U) snd_ac97_codec(U) ac97_bus(U) snd_seq_dummy(U) snd_seq_oss(U) snd_seq_midi_event(U) snd_se q(U) snd_seq_device(U) snd_pcm_oss(U) snd_mixer_oss(U) k8temp(U) hwmon(U) shpchp(U) snd_pcm(U) k8_edac(U) snd_timer(U) edac_mc(U) snd(U) i2c_nforce2(U) soundc ore(U) snd_page_alloc(U) sg(U) forcedeth(U) floppy(U) i2c_core(U) e1000e(U) pcspkr(U) parport_pc(U) parport(U) aacraid(U) pata_acpi(U) sata_nv(U) libata(U) sd _mod(U) scsi_mod(U) raid1(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) Jan 3 23:11:08 backup3 kernel: Pid: 11074, comm: ll_ost_io_08 Tainted: GF 2.6.18-92.1.10.el5_lustre.1.6.6custom #2 Jan 3 23:11:08 backup3 kernel: RIP: 0010:[<ffffffff80064a36>] [<ffffffff80064a36>] _spin_lock+0x9/0xa Jan 3 23:11:08 backup3 kernel: RSP: 0018:ffff81007db91308 EFLAGS: 00000247 Jan 3 23:11:08 backup3 kernel: RAX: ffff810015153758 RBX: ffff81003fd15258 RCX: 0000000000000007 Jan 3 23:11:08 backup3 kernel: RDX: ffff81003fd15268 RSI: ffff81001bb38000 RDI: ffff810015153778 Jan 3 23:11:08 backup3 kernel: RBP: 0000000000000000 R08: ffff810001578980 R09: 0000000000001000 Jan 3 23:11:08 backup3 kernel: R10: ffff81001bb38000 R11: 0000000000000000 R12: ffff81001fa33ba0 Jan 3 23:11:08 backup3 kernel: R13: 0000000000000000 R14: ffff81002a1164f8 R15: ffffffff88033cfe Jan 3 23:11:08 backup3 kernel: FS: 00002adcf73f8230(0000) GS:ffffffff803a0000(0000) knlGS:0000000000000000 Jan 3 23:11:08 backup3 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jan 3 23:11:08 backup3 kernel: CR2: 00002b5de33a4018 CR3: 0000000000201000 CR4: 00000000000006e0 Jan 3 23:11:08 backup3 kernel: Jan 3 23:11:08 backup3 kernel: Call Trace: Jan 3 23:11:08 backup3 kernel: [<ffffffff8891f709>] :ldiskfs:ldiskfs_mb_discard_group_preallocations+0x199/0x3c0 Jan 3 23:11:08 backup3 kernel: [<ffffffff8891f962>] :ldiskfs:ldiskfs_mb_discard_preallocations+0x32/0x70 Jan 3 23:11:08 backup3 kernel: [<ffffffff8891fd9b>] :ldiskfs:ldiskfs_mb_new_blocks+0x1cb/0x280 Jan 3 23:11:08 backup3 kernel: [<ffffffff8897bc11>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x301/0x620 Jan 3 23:11:08 backup3 kernel: [<ffffffff800194f6>] __getblk+0x25/0x235 Jan 3 23:11:08 backup3 kernel: [<ffffffff88918857>] :ldiskfs:ldiskfs_ext_walk_space+0x1b7/0x250 Jan 3 23:11:08 backup3 kernel: [<ffffffff8897b910>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x620 Jan 3 23:11:08 backup3 kernel: [<ffffffff88977d1e>] :fsfilt_ldiskfs:fsfilt_map_nblocks+0xfe/0x150 Jan 3 23:11:08 backup3 kernel: [<ffffffff889a9cf8>] :obdfilter:filter_direct_io+0x478/0xcd0 Jan 3 23:11:08 backup3 kernel: [<ffffffff889ac690>] :obdfilter:filter_commitrw_write+0x1450/0x23d0 Jan 3 23:11:08 backup3 kernel: [<ffffffff88955621>] :ost:ost_checksum_bulk+0x351/0x590 Jan 3 23:11:08 backup3 kernel: [<ffffffff889553ac>] :ost:ost_checksum_bulk+0xdc/0x590 Jan 3 23:11:08 backup3 kernel: [<ffffffff8895bda4>] :ost:ost_brw_write+0x2004/0x2760 Jan 3 23:11:08 backup3 kernel: [<ffffffff8008ad7e>] default_wake_function+0x0/0xe Jan 3 23:11:08 backup3 kernel: [<ffffffff88712148>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 Jan 3 23:11:08 backup3 kernel: [<ffffffff8895f3ee>] :ost:ost_handle+0x2eee/0x5bf0 Jan 3 23:11:08 backup3 kernel: [<ffffffff801437a4>] __next_cpu+0x19/0x28 Jan 3 23:11:08 backup3 kernel: [<ffffffff80089aa8>] find_busiest_group+0x20d/0x621 Jan 3 23:11:08 backup3 kernel: [<ffffffff88673031>] :obdclass:class_handle2object+0xd1/0x160 Jan 3 23:11:08 backup3 kernel: [<ffffffff88711705>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Jan 3 23:11:08 backup3 kernel: [<ffffffff8871b0da>] :ptlrpc:ptlrpc_check_req+0x1a/0x110 Jan 3 23:11:08 backup3 kernel: [<ffffffff8871d2c2>] :ptlrpc:ptlrpc_server_handle_request+0x992/0x1040 Jan 3 23:11:08 backup3 kernel: [<ffffffff80062f4b>] thread_return+0x0/0xdf Jan 3 23:11:08 backup3 kernel: [<ffffffff8006d940>] do_gettimeofday+0x50/0x92 Jan 3 23:11:08 backup3 kernel: [<ffffffff885d1476>] :libcfs:lcw_update_time+0x16/0x100 Jan 3 23:11:08 backup3 kernel: [<ffffffff800893bb>] __wake_up_common+0x3e/0x68 Jan 3 23:11:08 backup3 kernel: [<ffffffff8872022c>] :ptlrpc:ptlrpc_main+0xe0c/0xf90 Jan 3 23:11:08 backup3 kernel: [<ffffffff8008ad7e>] default_wake_function+0x0/0xe Jan 3 23:11:08 backup3 kernel: [<ffffffff800b4610>] audit_syscall_exit+0x31b/0x336 Jan 3 23:11:08 backup3 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Jan 3 23:11:08 backup3 kernel: [<ffffffff8871f420>] :ptlrpc:ptlrpc_main+0x0/0xf90 Jan 3 23:11:08 backup3 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Jan 3 23:11:08 backup3 kernel: $ uname -a Linux backup3 2.6.18-92.1.10.el5_lustre.1.6.6custom #2 SMP Fri Dec 26 20:27:40 CET 2008 x86_64 x86_64 x86_64 GNU/Linux The drbd is 8.3.0, but I think it's OK. Is this a known bug, or something other? Thank you, tamas _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
