[lustre-discuss] OST server seems overloaded ?

Tung-Han Hsieh Sat, 04 Jul 2020 20:33:30 -0700

Dear All,

One of our Lustre OST servers continuously shown up the following
error messages in dmesg:


==========================================================================
LNet: Service thread pid 51988 was inactive for 200.44s. Watchdog stack traces 
are limited to 3 per 300 seconds, skipping this one.
LNet: Service thread pid 63055 completed after 308.42s. This in dicates the 
system was overloaded (too many service threads, or there were not enough 
hardware resources).
LNet: Service thread pid 55541 was inactive for 232.30s. The thread might be 
hung, or it might only be slow and will resume later. Dumping the stack trace 
for debugging purposes:
Pid: 55541, comm: ll_ost_io01_100 3.12.72 #7 SMP Sun Feb 10 17:06:08 CST 2019
Call Trace:
 [<ffffffffa312f1b5>] cv_wait_common+0x95/0x110 [spl]
 [<ffffffffa312f263>] __cv_wait_io+0x13/0x20 [spl]
 [<ffffffffa32ce9b3>] zio_wait+0x113/0x1b0 [zfs]
 [<ffffffffa32210ac>] dmu_buf_hold_array_by_dnode+0x14c/0x4d0 [zfs]
 [<ffffffffa3221494>] dmu_buf_hold_array_by_bonus+0x64/0x80 [zfs]
 [<ffffffffa0377e71>] osd_bufs_get+0x3d1/0xc80 [osd_zfs]
 [<ffffffffa05687dd>] ofd_preprw+0x7dd/0x2000 [ofd]
 [<ffffffffa01c5659>] tgt_brw_read+0x5c9/0x1fb0 [ptlrpc]
 [<ffffffffa01c34e2>] tgt_request_handle+0x762/0x15f0 [ptlrpc]
 [<ffffffffa016de6e>] ptlrpc_main+0xfbe/0x2b30 [ptlrpc]
 [<ffffffff810614fe>] kthread+0xce/0xe0
 [<ffffffff814cced8>] ret_from_fork+0x58/0x90
 [<ffffffffffffffff>] 0xffffffffffffffff
==========================================================================

This OST server installed Lustre-2.10.7 with ZFS backend. It connected
to an external storage through one 8G/s fiber. The external storage is
an Infortrend DS1016 containing 24 bays with RAID6 + 1 hot spare. The
storage contains single partition formatted with ZFS backend with 113TB.
The OST server serves 44 computing nodes, each node has 12 - 32 cores,
and usually full loaded. The OST server has the following hardware spec:

- CPU: Intel Xeon Silver 4214, 2.2GHz, dual CPU, totally 24 cores.
- RAM: 128GB
- Infiniband FDR for internal cluster communication.

and every computing node and the MDT server pocess infiniband network.

We are wondering whether the hardware configuration of this OST server
plus the external storage is really overloaded or not. If yes, then
what else could we do for the improvement.

Thanks very much for your kindly suggestions.

Best Regards,

T.H.Hsieh
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] OST server seems overloaded ?

Reply via email to