Hi Tung-Han, Your stack trace looks similar to the one we’ve just seen yesterday on our 2.10.6 system.
I’ve open https://jira.whamcloud.com/browse/LU-12136 to track the issue. Best, Stephane > On Mar 29, 2019, at 8:47 PM, Tung-Han Hsieh <thhs...@twcp1.phys.ntu.edu.tw> > wrote: > > Dear All, > > Our system was recently upgraded to lustre-2.10.6. We are doing the > data migration from some almost full OSTs to a newly installed file > server. But we often encountered file system freezed for about 30 secs, > and then returned to normal (within 5 mins it may happen several times). > > Our procedure is following. > > 1. In the MDS, we prevented data writing to the OSTs which are almost full: > echo 0 > /proc/fs/lustre/osc/chome-OST0000-osc-MDT0000/max_create_count > echo 0 > /proc/fs/lustre/osc/chome-OST0001-osc-MDT0000/max_create_count > echo 0 > /proc/fs/lustre/osc/chome-OST0002-osc-MDT0000/max_create_count > .... > > 2. In our system we have 40 OSTs, in which 36 OSTs are almost full so they > are all marked by the above command. Our total OST size is 286TB. We are > moving out part of their data to the remaining 4 new OSTs by the following > standard way: > > cp -a /path/to/data /path/to/data.tmp > mv /path/to/data.tmp /path/to/data > > In the beginning, everything looks smoothly. But after one week of running, > the progress becoming slower and slower. Then we found that the file system > often got freezed for a while when the data migration is running. However, > there is almost no any loading in the whole system. > > It is also strange that during the past week, we did not see any logs in > 'dmesg' messages of MDT, OSTs, and the client. Until last night, only MDT > prompted up these 'dmesg' messages: > > ============================================================================== > [410649.811086] LNet: Service thread pid 3516 was inactive for 200.27s. The > thread might be hung, or it might only be slow and will resume later. Dumping > the stack trace for debugging purposes: > [410649.811175] Pid: 3516, comm: mdt00_003 > [410649.811201] > [410649.811202] Call Trace: > [410649.811250] [<ffffffff81033625>] ? check_preempt_curr+0x75/0xa0 > [410649.811278] [<ffffffff8103366b>] ? ttwu_do_wakeup+0x1b/0xa0 > [410649.811306] [<ffffffff810376c1>] ? > ttwu_do_activate.constprop.160+0x61/0x70 > [410649.811336] [<ffffffff8103c40a>] ? try_to_wake_up+0x1da/0x280 > [410649.811367] [<ffffffff814106ba>] schedule+0x3a/0x50 > [410649.811393] [<ffffffff81410a95>] schedule_timeout+0x145/0x210 > [410649.811421] [<ffffffff8104d090>] ? process_timeout+0x0/0x10 > [410649.811451] [<ffffffffa0b6dc88>] osp_precreate_reserve+0x328/0x8b0 [osp] > [410649.811484] [<ffffffffa014f026>] ? do_get_write_access+0x396/0x4d0 [jbd2] > [410649.811515] [<ffffffff81112a10>] ? __getblk+0x20/0x2e0 > [410649.811542] [<ffffffff8103c4b0>] ? default_wake_function+0x0/0x10 > [410649.811571] [<ffffffffa0b64759>] osp_declare_create+0x1a9/0x680 [osp] > [410649.811603] [<ffffffffa0ab2a10>] lod_sub_declare_create+0xe0/0x270 [lod] > [410649.811633] [<ffffffffa0aabdc7>] lod_qos_declare_object_on+0xc7/0x3d0 > [lod] > [410649.811664] [<ffffffffa0aab7fe>] ? lod_statfs_and_check+0xae/0x5b0 [lod] > [410649.811694] [<ffffffffa0aacfd4>] lod_alloc_qos.constprop.10+0xe64/0x17b0 > [lod] > [410649.811741] [<ffffffffa01741b0>] ? ldiskfs_map_blocks+0x180/0x1e0 > [ldiskfs][410649.811772] [<ffffffffa0ab07ea>] > lod_qos_prep_create+0x12ea/0x2910 [lod] > [410649.811803] [<ffffffffa07c8c94>] ? qsd_op_begin+0x114/0x4d0 [lquota] > [410649.811833] [<ffffffffa0ab23f0>] lod_prepare_create+0x2c0/0x410 [lod] > [410649.811863] [<ffffffffa0aa7ccd>] lod_declare_striped_create+0x10d/0xa50 > [lod] > [410649.811908] [<ffffffffa0aaa9b9>] lod_declare_create+0x1e9/0x5a0 [lod] > [410649.811938] [<ffffffffa0b1af26>] > mdd_declare_create_object_internal+0x116/0x320 [mdd] > [410649.811983] [<ffffffffa0b00c9c>] > mdd_declare_create_object.isra.19+0x3c/0xbb0 [mdd] > [410649.812028] [<ffffffffa0b00044>] ? mdd_linkea_prepare+0x294/0x590 [mdd] > [410649.812058] [<ffffffffa0b0f90e>] mdd_create+0x88e/0x27d0 [mdd] > [410649.812088] [<ffffffffa08173e0>] ? osd_xattr_get+0x80/0x890 [osd_ldiskfs] > [410649.812120] [<ffffffffa09fd3ff>] mdt_reint_open+0x225f/0x3890 [mdt] > [410649.812158] [<ffffffffa0431276>] ? null_alloc_rs+0x186/0x340 [ptlrpc] > [410649.812191] [<ffffffffa02ea3fa>] ? upcall_cache_get_entry+0x29a/0x890 > [obdclass] > [410649.812237] [<ffffffffa02ef409>] ? lu_ucred+0x19/0x30 [obdclass] > [410649.812267] [<ffffffffa09df7db>] ? ucred_set_jobid+0x5b/0x70 [mdt] > [410649.812297] [<ffffffffa09f1810>] mdt_reint_rec+0xa0/0x210 [mdt] > [410649.812326] [<ffffffffa09de91d>] mdt_reint_internal+0x63d/0xa50 [mdt] > [410649.812356] [<ffffffffa09df07a>] mdt_intent_reint+0x21a/0x430 [mdt] > [410649.812385] [<ffffffffa09da7ed>] mdt_intent_policy+0x5bd/0xde0 [mdt] > [410649.812418] [<ffffffffa03aa257>] ldlm_lock_enqueue+0x3a7/0x9c0 [ptlrpc] > [410649.812453] [<ffffffffa03d28e3>] ldlm_handle_enqueue0+0x9c3/0x1790 > [ptlrpc] > [410649.812490] [<ffffffffa04211c0>] ? req_capsule_client_get+0x10/0x20 > [ptlrpc] > [410649.812541] [<ffffffffa045d16c>] ? > tgt_request_preprocess.isra.17+0x25c/0x1250 [ptlrpc] > [410649.812593] [<ffffffffa044ee95>] ? tgt_lookup_reply+0x35/0x1c0 [ptlrpc] > [410649.812629] [<ffffffffa045b74d>] tgt_enqueue+0x5d/0x250 [ptlrpc] > [410649.812664] [<ffffffffa045ed1d>] tgt_request_handle+0x8ad/0x15a0 [ptlrpc] > [410649.812701] [<ffffffffa03f84a4>] ? lustre_msg_get_transno+0x84/0x100 > [ptlrpc] > [410649.812752] [<ffffffffa04084e1>] ptlrpc_main+0x1051/0x2a40 [ptlrpc] > [410649.812780] [<ffffffff8140ff44>] ? __schedule+0x294/0x940 > [410649.812815] [<ffffffffa0407490>] ? ptlrpc_main+0x0/0x2a40 [ptlrpc] > [410649.812844] [<ffffffff8105c817>] kthread+0x87/0x90 > [410649.812870] [<ffffffff81413c34>] kernel_thread_helper+0x4/0x10 > [410649.812898] [<ffffffff8105c790>] ? kthread+0x0/0x90 > [410649.812923] [<ffffffff81413c30>] ? kernel_thread_helper+0x0/0x10 > [410649.812950] > [410649.812970] LustreError: dumping log to /tmp/lustre-log.1553888141.3516 > [410649.818730] wanted to write 3985 but wrote 3518 > [410749.275270] LNet: Service thread pid 3516 completed after 299.99s. This > indicates the system was overloaded (too many service threads, or there were > not enough hardware resources). > ============================================================================== > > Our lustre-2.10.6 was compiled with sles11sp3 kernel > > linux-3.0.101-138.gcdbe806 > > using ldiskfs backend. The hardware spec of our MDS is: > > CPU: Intel Xeon E5640 @ 2.67GHz (single CPU) > RAM: 8 GB > MGS: 1GB (under RAID 1) > MDT: 230GB (under RAID 1) > RAID controller: LSI ServeRAID M1015 SAS/SATA Controller > > > Is there any suggestion to fix this problem ? > > Thank you very much. > > > T.H.Hsieh > _______________________________________________ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org