Dear All, Our system was recently upgraded to lustre-2.10.6. We are doing the data migration from some almost full OSTs to a newly installed file server. But we often encountered file system freezed for about 30 secs, and then returned to normal (within 5 mins it may happen several times).
Our procedure is following. 1. In the MDS, we prevented data writing to the OSTs which are almost full: echo 0 > /proc/fs/lustre/osc/chome-OST0000-osc-MDT0000/max_create_count echo 0 > /proc/fs/lustre/osc/chome-OST0001-osc-MDT0000/max_create_count echo 0 > /proc/fs/lustre/osc/chome-OST0002-osc-MDT0000/max_create_count .... 2. In our system we have 40 OSTs, in which 36 OSTs are almost full so they are all marked by the above command. Our total OST size is 286TB. We are moving out part of their data to the remaining 4 new OSTs by the following standard way: cp -a /path/to/data /path/to/data.tmp mv /path/to/data.tmp /path/to/data In the beginning, everything looks smoothly. But after one week of running, the progress becoming slower and slower. Then we found that the file system often got freezed for a while when the data migration is running. However, there is almost no any loading in the whole system. It is also strange that during the past week, we did not see any logs in 'dmesg' messages of MDT, OSTs, and the client. Until last night, only MDT prompted up these 'dmesg' messages: ============================================================================== [410649.811086] LNet: Service thread pid 3516 was inactive for 200.27s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [410649.811175] Pid: 3516, comm: mdt00_003 [410649.811201] [410649.811202] Call Trace: [410649.811250] [<ffffffff81033625>] ? check_preempt_curr+0x75/0xa0 [410649.811278] [<ffffffff8103366b>] ? ttwu_do_wakeup+0x1b/0xa0 [410649.811306] [<ffffffff810376c1>] ? ttwu_do_activate.constprop.160+0x61/0x70 [410649.811336] [<ffffffff8103c40a>] ? try_to_wake_up+0x1da/0x280 [410649.811367] [<ffffffff814106ba>] schedule+0x3a/0x50 [410649.811393] [<ffffffff81410a95>] schedule_timeout+0x145/0x210 [410649.811421] [<ffffffff8104d090>] ? process_timeout+0x0/0x10 [410649.811451] [<ffffffffa0b6dc88>] osp_precreate_reserve+0x328/0x8b0 [osp] [410649.811484] [<ffffffffa014f026>] ? do_get_write_access+0x396/0x4d0 [jbd2] [410649.811515] [<ffffffff81112a10>] ? __getblk+0x20/0x2e0 [410649.811542] [<ffffffff8103c4b0>] ? default_wake_function+0x0/0x10 [410649.811571] [<ffffffffa0b64759>] osp_declare_create+0x1a9/0x680 [osp] [410649.811603] [<ffffffffa0ab2a10>] lod_sub_declare_create+0xe0/0x270 [lod] [410649.811633] [<ffffffffa0aabdc7>] lod_qos_declare_object_on+0xc7/0x3d0 [lod] [410649.811664] [<ffffffffa0aab7fe>] ? lod_statfs_and_check+0xae/0x5b0 [lod] [410649.811694] [<ffffffffa0aacfd4>] lod_alloc_qos.constprop.10+0xe64/0x17b0 [lod] [410649.811741] [<ffffffffa01741b0>] ? ldiskfs_map_blocks+0x180/0x1e0 [ldiskfs][410649.811772] [<ffffffffa0ab07ea>] lod_qos_prep_create+0x12ea/0x2910 [lod] [410649.811803] [<ffffffffa07c8c94>] ? qsd_op_begin+0x114/0x4d0 [lquota] [410649.811833] [<ffffffffa0ab23f0>] lod_prepare_create+0x2c0/0x410 [lod] [410649.811863] [<ffffffffa0aa7ccd>] lod_declare_striped_create+0x10d/0xa50 [lod] [410649.811908] [<ffffffffa0aaa9b9>] lod_declare_create+0x1e9/0x5a0 [lod] [410649.811938] [<ffffffffa0b1af26>] mdd_declare_create_object_internal+0x116/0x320 [mdd] [410649.811983] [<ffffffffa0b00c9c>] mdd_declare_create_object.isra.19+0x3c/0xbb0 [mdd] [410649.812028] [<ffffffffa0b00044>] ? mdd_linkea_prepare+0x294/0x590 [mdd] [410649.812058] [<ffffffffa0b0f90e>] mdd_create+0x88e/0x27d0 [mdd] [410649.812088] [<ffffffffa08173e0>] ? osd_xattr_get+0x80/0x890 [osd_ldiskfs] [410649.812120] [<ffffffffa09fd3ff>] mdt_reint_open+0x225f/0x3890 [mdt] [410649.812158] [<ffffffffa0431276>] ? null_alloc_rs+0x186/0x340 [ptlrpc] [410649.812191] [<ffffffffa02ea3fa>] ? upcall_cache_get_entry+0x29a/0x890 [obdclass] [410649.812237] [<ffffffffa02ef409>] ? lu_ucred+0x19/0x30 [obdclass] [410649.812267] [<ffffffffa09df7db>] ? ucred_set_jobid+0x5b/0x70 [mdt] [410649.812297] [<ffffffffa09f1810>] mdt_reint_rec+0xa0/0x210 [mdt] [410649.812326] [<ffffffffa09de91d>] mdt_reint_internal+0x63d/0xa50 [mdt] [410649.812356] [<ffffffffa09df07a>] mdt_intent_reint+0x21a/0x430 [mdt] [410649.812385] [<ffffffffa09da7ed>] mdt_intent_policy+0x5bd/0xde0 [mdt] [410649.812418] [<ffffffffa03aa257>] ldlm_lock_enqueue+0x3a7/0x9c0 [ptlrpc] [410649.812453] [<ffffffffa03d28e3>] ldlm_handle_enqueue0+0x9c3/0x1790 [ptlrpc] [410649.812490] [<ffffffffa04211c0>] ? req_capsule_client_get+0x10/0x20 [ptlrpc] [410649.812541] [<ffffffffa045d16c>] ? tgt_request_preprocess.isra.17+0x25c/0x1250 [ptlrpc] [410649.812593] [<ffffffffa044ee95>] ? tgt_lookup_reply+0x35/0x1c0 [ptlrpc] [410649.812629] [<ffffffffa045b74d>] tgt_enqueue+0x5d/0x250 [ptlrpc] [410649.812664] [<ffffffffa045ed1d>] tgt_request_handle+0x8ad/0x15a0 [ptlrpc] [410649.812701] [<ffffffffa03f84a4>] ? lustre_msg_get_transno+0x84/0x100 [ptlrpc] [410649.812752] [<ffffffffa04084e1>] ptlrpc_main+0x1051/0x2a40 [ptlrpc] [410649.812780] [<ffffffff8140ff44>] ? __schedule+0x294/0x940 [410649.812815] [<ffffffffa0407490>] ? ptlrpc_main+0x0/0x2a40 [ptlrpc] [410649.812844] [<ffffffff8105c817>] kthread+0x87/0x90 [410649.812870] [<ffffffff81413c34>] kernel_thread_helper+0x4/0x10 [410649.812898] [<ffffffff8105c790>] ? kthread+0x0/0x90 [410649.812923] [<ffffffff81413c30>] ? kernel_thread_helper+0x0/0x10 [410649.812950] [410649.812970] LustreError: dumping log to /tmp/lustre-log.1553888141.3516 [410649.818730] wanted to write 3985 but wrote 3518 [410749.275270] LNet: Service thread pid 3516 completed after 299.99s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). ============================================================================== Our lustre-2.10.6 was compiled with sles11sp3 kernel linux-3.0.101-138.gcdbe806 using ldiskfs backend. The hardware spec of our MDS is: CPU: Intel Xeon E5640 @ 2.67GHz (single CPU) RAM: 8 GB MGS: 1GB (under RAID 1) MDT: 230GB (under RAID 1) RAID controller: LSI ServeRAID M1015 SAS/SATA Controller Is there any suggestion to fix this problem ? Thank you very much. T.H.Hsieh _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org