Mike, It looks like the mds server is having a problem contacting the mgs server. I'm guessing the mgs is a separate host? I would start by looking for possible network problems that might explain the LNet timeouts. You can try using "lctl ping" to test the LNet connection between nodes, and you can also try regular "ping" between the IP addresses on the IB interfaces.
--Rick On 6/21/23, 11:35 AM, "lustre-discuss on behalf of Mike Mosley via lustre-discuss" <lustre-discuss-boun...@lists.lustre.org <mailto:lustre-discuss-boun...@lists.lustre.org> on behalf of lustre-discuss@lists.lustre.org <mailto:lustre-discuss@lists.lustre.org>> wrote: Greetings, We have experienced some type of issue that is causing both of our MDS servers to only be able to mount the mdt device in read only mode. Here are some of the error messages we are seeing in the log files below. We lost our Lustre expert a while back and we are not sure how to proceed to troubleshoot this issue. Can anybody provide us guidance on how to proceed? Thanks, Mike Jun 20 15:12:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for more than 120 seconds. Jun 20 15:12:14 hyd-mds1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 20 15:12:14 hyd-mds1 kernel: mount.lustre D ffff9f27a3bc5230 0 4123 1 0x00000086 Jun 20 15:12:14 hyd-mds1 kernel: Call Trace: Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb585da9>] schedule+0x29/0x70 Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb5838b1>] schedule_timeout+0x221/0x2d0 Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaf6b8e5>] ? tracing_is_on+0x15/0x30 Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaf6f5bd>] ? tracing_record_cmdline+0x1d/0x120 Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaf77d9b>] ? probe_sched_wakeup+0x2b/0xa0 Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaed7d15>] ? ttwu_do_wakeup+0xb5/0xe0 Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb58615d>] wait_for_completion+0xfd/0x140 Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaedb990>] ? wake_up_state+0x20/0x20 Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f529a4>] llog_process_or_fork+0x244/0x450 [obdclass] Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f52bc4>] llog_process+0x14/0x20 [obdclass] Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f85d05>] class_config_parse_llog+0x125/0x350 [obdclass] Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a69fc0>] mgc_process_cfg_log+0x790/0xc40 [mgc] Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a6d4cc>] mgc_process_log+0x3dc/0x8f0 [mgc] Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a6e15f>] ? config_recover_log_add+0x13f/0x280 [mgc] Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f8df40>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a6eb2b>] mgc_process_config+0x88b/0x13f0 [mgc] Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f91b58>] lustre_process_log+0x2d8/0xad0 [obdclass] Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0e5a177>] ? libcfs_debug_msg+0x57/0x80 [libcfs] Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f7c8b9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0fc08f4>] server_start_targets+0x13a4/0x2a20 [obdclass] Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f94bb0>] ? lustre_start_mgc+0x260/0x2510 [obdclass] Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f8df40>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0fc303c>] server_fill_super+0x10cc/0x1890 [obdclass] Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f97a08>] lustre_fill_super+0x468/0x960 [obdclass] Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f975a0>] ? lustre_common_put_super+0x270/0x270 [obdclass] Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb0510cf>] mount_nodev+0x4f/0xb0 Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f8f9a8>] lustre_mount+0x38/0x60 [obdclass] Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb051c4e>] mount_fs+0x3e/0x1b0 Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb0707a7>] vfs_kern_mount+0x67/0x110 Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb072edf>] do_mount+0x1ef/0xd00 Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb049d7a>] ? __check_object_size+0x1ca/0x250 Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb0288ec>] ? kmem_cache_alloc_trace+0x3c/0x200 Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb073d33>] SyS_mount+0x83/0xd0 Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb592ed2>] system_call_fastpath+0x25/0x2a Jun 20 15:13:14 hyd-mds1 kernel: LNet: 4458:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 172.16.100.4@o2ib: 9 seconds Jun 20 15:13:14 hyd-mds1 kernel: LNet: 4458:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 239 previous similar messages Jun 20 15:14:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for more than 120 seconds. Jun 20 15:14:14 hyd-mds1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 20 15:14:14 hyd-mds1 kernel: mount.lustre D ffff9f27a3bc5230 0 4123 1 0x00000086 dumpe2fs seems to show that the file systems are clean i.e. dumpe2fs 1.45.6.wc1 (20-Mar-2020) Filesystem volume name: hydra-MDT0000 Last mounted on: / Filesystem UUID: 3ae09231-7f2a-43b3-a4ee-7f36080b5a66 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype mmp flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 2247671504 Block count: 1404931944 Reserved block count: 70246597 Free blocks: 807627552 Free inodes: 2100036536 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 1024 Blocks per group: 20472 Fragments per group: 20472 Inodes per group: 32752 Inode blocks per group: 8188 Flex block group size: 16 Filesystem created: Thu Aug 8 14:21:01 2019 Last mount time: Tue Jun 20 15:19:03 2023 Last write time: Wed Jun 21 10:43:51 2023 Mount count: 38 Maximum mount count: -1 Last checked: Thu Aug 8 14:21:01 2019 Check interval: 0 (<none>) Lifetime writes: 219 TB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 1024 Required extra isize: 32 Desired extra isize: 32 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: 2e518531-82d9-4652-9acd-9cf9ca09c399 Journal backup: inode blocks MMP block number: 1851467 MMP update interval: 5 User quota inode: 3 Group quota inode: 4 Journal features: journal_incompat_revoke Journal size: 4096M Journal length: 1048576 Journal sequence: 0x0a280713 Journal start: 0 MMP_block: mmp_magic: 0x4d4d50 mmp_check_interval: 6 mmp_sequence: 0xff4d4d50 mmp_update_date: Wed Jun 21 10:43:51 2023 mmp_update_time: 1687358631 mmp_node_name: hyd-mds1.uncc.edu <_blank> mmp_device_name: dm-0 _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org