As mentioned in the original email I did do the correct writeconf procedure as 
per the manual but it didn't work.

Anyway I've managed to fix it to some extent at least.

lnet didn't have a tcp interface on one subnet on the OSSes, only the o2ib 
interface. I added the tcp one and it's started working.

It's a bit mystifying as all the parameters are set up using the o2ib subnet 
addresses and interfaces. I assume this means the traffic is all over ib, but 
if there's a good way to confirm that I'd welcome hearing it!

--
Dr Edd Edmondson
HPC Systems Manager
Dept of Physics and Astronomy
University College London

(he/him) During remote working email is the best way to contact me. If needed I 
am available by phone on 0203 108 1399, by Microsoft Teams, or other methods by 
arrangement.
On 18 Jan 2023 at 10:50 +0000, Edmondson, Edward via lustre-discuss 
<lustre-discuss@lists.lustre.org>, wrote:

⚠ Caution: External sender

Hi all,

I'm struggling to get my OSS mounts online after a less than clean shutdown. 
I'm on lustre 2.12.9. Plenty of googling etc doesn’t bring up anything that 
seems particular to the problem I’m having unfortunately.

lnet seems to be up, pings ok both ways, communications clearly happen between 
the nodes judging by the logs. I've been through the log reconfiguration 
process with --writeconf on everything, step by step as in the manual

On the OSS node when I try to mount:
mount.lustre: mount /dev/mapper/lustre-oss0 at /mnt/oss0 failed: No such file 
or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)

In logs:
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 
31015:0:(ldlm_lib.c:494:client_obd_setup()) can't add initial connection
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 
31015:0:(lwp_dev.c:125:lwp_setup()) lustre-MDT0000-lwp-OST0000: client obd 
setup error: rc = -2
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 
31015:0:(lwp_dev.c:273:lwp_init0()) lustre-MDT0000-lwp-OST0000: setup lwp 
failed. -2
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 
31015:0:(obd_config.c:559:class_setup()) setup lustre-MDT0000-lwp-OST0000 
failed (-2)
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 
31015:0:(obd_mount.c:202:lustre_start_simple()) lustre-MDT0000-lwp-OST0000 
setup error -2
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 
31015:0:(obd_mount_server.c:671:lustre_lwp_setup()) lustre-MDT0000-lwp-OST0000: 
setup up failed: rc -2
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 15c-8: MGC10.3.255.200@o2ib: The 
configuration from log 'lustre-client' failed (-2). This may be the result of 
communication errors between this node and the MGS, a bad configuration, or 
other errors. See the syslog for more information.
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 
30961:0:(obd_mount_server.c:1414:server_start_targets()) lustre-OST0000: failed 
to start LWP: -2
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 
30961:0:(obd_mount_server.c:1992:server_fill_super()) Unable to start targets: 
-2
Jan 18 10:27:56 nas-0-4 kernel: Lustre: Failing over lustre-OST0000
Jan 18 10:27:57 nas-0-4 kernel: LustreError: 
30961:0:(ldlm_lockd.c:3203:ldlm_cleanup()) ldlm still has namespaces; clean 
these up first.
Jan 18 10:27:57 nas-0-4 kernel: LustreError: 
30961:0:(ldlm_lockd.c:2862:ldlm_put_ref()) ldlm_cleanup failed: -16
Jan 18 10:27:57 nas-0-4 kernel: Lustre: server umount lustre-OST0000 complete
Jan 18 10:27:57 nas-0-4 kernel: LustreError: 
30961:0:(obd_mount.c:1604:lustre_fill_super()) Unable to mount (-2)

On the MGS/MDT node (which has now mounted the MGS and MDT fine):
Jan 18 10:27:56 nas-0-3 kernel: Lustre: MGS: Connection restored to 
24758df3-a11a-f5db-18a5-2e0e35f2099d (at 10.3.255.199@o2ib)
Jan 18 10:27:56 nas-0-3 kernel: Lustre: MGS: Regenerating lustre-OST0000 log by 
user request: rc = 0
Jan 18 10:27:56 nas-0-3 kernel: Lustre: Found index 0 for lustre-OST0000, 
updating log
Jan 18 10:27:56 nas-0-3 kernel: Lustre: Client log for lustre-OST0000 was not 
updated; writeconf the MDT first to regenerate it.

The MDT has absolutely been writeconfed so that last message isn't terribly 
helpful. fscks are clean, so there's not a problem there.

Any advice hugely appreciated!

--
Dr Edd Edmondson
HPC Systems Manager
Dept of Physics and Astronomy
University College London

(he/him) During remote working email is the best way to contact me. If needed I 
am available by phone on 0203 108 1399, by Microsoft Teams, or other methods by 
arrangement.
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
  • [lustre-discuss] ... Edmondson, Edward via lustre-discuss
    • Re: [lustre-... Andreas Dilger via lustre-discuss
      • Re: [lus... Hanafi, Mahmoud (ARC-TN)[InuTeq, LLC] via lustre-discuss
    • Re: [lustre-... Edmondson, Edward via lustre-discuss

Reply via email to