Dear All, One of our Lustre OST servers continuously shown up the following error messages in dmesg:
========================================================================== LNet: Service thread pid 51988 was inactive for 200.44s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. LNet: Service thread pid 63055 completed after 308.42s. This in dicates the system was overloaded (too many service threads, or there were not enough hardware resources). LNet: Service thread pid 55541 was inactive for 232.30s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 55541, comm: ll_ost_io01_100 3.12.72 #7 SMP Sun Feb 10 17:06:08 CST 2019 Call Trace: [<ffffffffa312f1b5>] cv_wait_common+0x95/0x110 [spl] [<ffffffffa312f263>] __cv_wait_io+0x13/0x20 [spl] [<ffffffffa32ce9b3>] zio_wait+0x113/0x1b0 [zfs] [<ffffffffa32210ac>] dmu_buf_hold_array_by_dnode+0x14c/0x4d0 [zfs] [<ffffffffa3221494>] dmu_buf_hold_array_by_bonus+0x64/0x80 [zfs] [<ffffffffa0377e71>] osd_bufs_get+0x3d1/0xc80 [osd_zfs] [<ffffffffa05687dd>] ofd_preprw+0x7dd/0x2000 [ofd] [<ffffffffa01c5659>] tgt_brw_read+0x5c9/0x1fb0 [ptlrpc] [<ffffffffa01c34e2>] tgt_request_handle+0x762/0x15f0 [ptlrpc] [<ffffffffa016de6e>] ptlrpc_main+0xfbe/0x2b30 [ptlrpc] [<ffffffff810614fe>] kthread+0xce/0xe0 [<ffffffff814cced8>] ret_from_fork+0x58/0x90 [<ffffffffffffffff>] 0xffffffffffffffff ========================================================================== This OST server installed Lustre-2.10.7 with ZFS backend. It connected to an external storage through one 8G/s fiber. The external storage is an Infortrend DS1016 containing 24 bays with RAID6 + 1 hot spare. The storage contains single partition formatted with ZFS backend with 113TB. The OST server serves 44 computing nodes, each node has 12 - 32 cores, and usually full loaded. The OST server has the following hardware spec: - CPU: Intel Xeon Silver 4214, 2.2GHz, dual CPU, totally 24 cores. - RAM: 128GB - Infiniband FDR for internal cluster communication. and every computing node and the MDT server pocess infiniband network. We are wondering whether the hardware configuration of this OST server plus the external storage is really overloaded or not. If yes, then what else could we do for the improvement. Thanks very much for your kindly suggestions. Best Regards, T.H.Hsieh _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org