Hi all, I've experienced reproducible OSS crashes with 1.6.5 but also 1.6.4.3/1.6.4.2. The cluster is running Debian Etch64, kernel 2.6.22. The OSS are file servers with two OSTs. I'm now testing it by just using one OSS in the system (but encountered the problem first with 9 OSS), mounting Lustre on 4 clients and writing to it using the stress utility: "stress -d 2 --hdd-noclean --hdd-bytes 5M "
Once the OSTs are filled up to > 60%, the machine will just stop working. There are no traces in any of the logs that would relate directly to the moment of failure. I have repeated the procedure now with 9 of these machines, all of them SuperMicro X7DB8 16 slot file servers, 2 Intel Xeon E5320 Quadcores, 8GB RAM and one older SuperMicro X7DB8 with 2 Dual Core Xeons and 4 GB RAM, on a Lustre 1.6.4.2 test system. All of these machines have two 3ware 9650 RAID controllers, 500 TB WD Disks in RAID 5. Subsequently I reformatted the OST with ext3 and ran the stress test locally on the machine: no failure, the partition filled to 100% without problem. All of this seems to indicate that it is not a (sole) hardware problem. Prior to the recent crash the following is found in /var/log/kern.log: Jul 22 21:23:52 kernel: Lustre: 25706:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: slow i_mutex 30s Jul 22 21:24:10 kernel: Lustre: 25692:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0000: slow journal start 37s Jul 22 21:24:10 kernel: Lustre: 25692:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0000: slow brw_start 37s Jul 22 21:24:10 kernel: Lustre: 25697:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: slow i_mutex 37s Jul 22 21:46:55 kernel: Lustre: 25680:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: slow i_mutex 31s Jul 22 21:46:55 kernel: Lustre: 25680:0:(filter_io_26.c:700:filter_commitrw_write()) Skipped 2 previous similar messages Jul 22 21:47:06 kernel: Lustre: 25733:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: slow i_mutex 30s Jul 22 21:47:10 kernel: Lustre: 25744:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: slow i_mutex 31s Jul 22 21:47:15 kernel: Lustre: 25729:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: slow journal start 30s Jul 22 21:47:15 kernel: Lustre: 25729:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: slow brw_start 30s Jul 22 21:47:54 kernel: Lustre: 25662:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: slow i_mutex 36s Jul 22 21:48:30 kernel: Lustre: 25721:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: slow journal start 33s Jul 22 21:48:30 kernel: Lustre: 25721:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: slow brw_start 33s Jul 22 21:48:30 kernel: Lustre: 25736:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: slow i_mutex 33s However, these messages might as well have appeared while the machine was still working. Some of my OSS I managed to crash with a trace in kern.log, a known bug in ext3/ext4 code I think: Jul 14 21:41:19 kernel: uh! busy PA Jul 14 21:41:19 kernel: Jul 14 21:41:19 kernel: Call Trace: Jul 14 21:41:19 kernel: [<ffffffff8857c5ce>] :ldiskfs:ldiskfs_mb_discard_group_preallocations+0x2ae/0x400 Jul 14 21:41:19 kernel: [<ffffffff8857c75a>] :ldiskfs:ldiskfs_mb_discard_preallocations+0x3a/0x70 Jul 14 21:41:19 kernel: [<ffffffff8857cc0d>] :ldiskfs:ldiskfs_mb_new_blocks+0x24d/0x270 Jul 14 21:41:19 kernel: [<ffffffff802a1025>] __find_get_block_slow+0x2f/0xf1 Jul 14 21:41:19 kernel: [<ffffffff885c90ef>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x4ef/0x640 Jul 14 21:41:19 kernel: [<ffffffff802a1428>] __getblk+0x1d/0x20d Jul 14 21:41:19 kernel: [<ffffffff88577801>] :ldiskfs:ldiskfs_ext_walk_space+0x131/0x250 Jul 14 21:41:19 kernel: [<ffffffff885c8c00>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x640 Jul 14 21:41:19 kernel: [<ffffffff885c48c4>] :fsfilt_ldiskfs:fsfilt_map_nblocks+0xa4/0x150 Jul 14 21:41:19 kernel: [<ffffffff8845095c>] :ksocklnd:ksocknal_alloc_tx+0x2c/0x2a0 Jul 14 21:41:19 kernel: [<ffffffff885f24db>] :obdfilter:filter_direct_io+0x12b/0xd60 Jul 14 21:41:19 kernel: [<ffffffff885f3e9d>] :obdfilter:filter_commitrw_write+0x7bd/0x2640 Jul 14 21:41:19 kernel: [<ffffffff883bdbbe>] :ptlrpc:ldlm_resource_foreach+0x6e/0x3a0 Jul 14 21:41:19 kernel: [<ffffffff8023a0a4>] lock_timer_base+0x26/0x4b Jul 14 21:41:19 kernel: [<ffffffff885aec4e>] :ost:ost_brw_write+0x15be/0x2990 Jul 14 21:41:19 kernel: [<ffffffff8022c506>] default_wake_function+0x0/0xe Jul 14 21:41:19 kernel: [<ffffffff885b32f5>] :ost:ost_handle+0x2745/0x5ed0 Jul 14 21:41:19 kernel: [<ffffffff8027db5c>] cache_alloc_refill+0x94/0x1e8 Jul 14 21:41:19 kernel: [<ffffffff8022a52c>] find_busiest_group+0x255/0x6cf Jul 14 21:41:19 kernel: [<ffffffff80246a60>] do_gettimeofday+0x2e/0x9e Jul 14 21:41:19 kernel: [<ffffffff8023a0a4>] lock_timer_base+0x26/0x4b Jul 14 21:41:19 kernel: [<ffffffff8834d088>] :obdclass:class_handle2object+0x88/0x180 Jul 14 21:41:19 kernel: [<ffffffff883e41d0>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 Jul 14 21:41:19 kernel: [<ffffffff883e1c9e>] :ptlrpc:lustre_swab_buf+0xbe/0xf0 Jul 14 21:41:19 kernel: [<ffffffff8023a1ef>] __mod_timer+0xb6/0xc4 Jul 14 21:41:19 kernel: [<ffffffff883ec38f>] :ptlrpc:ptlrpc_main+0x130f/0x1ce0 Jul 14 21:41:19 kernel: [<ffffffff8022c506>] default_wake_function+0x0/0xe Jul 14 21:41:19 kernel: [<ffffffff8020aba8>] child_rip+0xa/0x12 Jul 14 21:41:19 kernel: [<ffffffff883eb080>] :ptlrpc:ptlrpc_main+0x0/0x1ce0 Jul 14 21:41:19 kernel: [<ffffffff8020ab9e>] child_rip+0x0/0x12 Neither the "slow"-something messages nor the " uh! busy PA" trace show up in all cases of crashed OSS, so I have no idea whether this is related at all. In any case any hints would be appreciated, Thomas _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss