Re: [lustre-discuss] MGS failover problem

Vicker, Darby (JSC-EG311) Tue, 10 Jan 2017 08:36:08 -0800

I’ve tried the writeconf (and fixed the servicenode’s at the same time) but 
that didn’t seem to work either.  The steps I took were:


- umount all clients
- unmount all servers (mdt and oss’s)
- tunefs.lustre commands (see detail below)
- remount mdt on primary node
- remount ost’s (all on primary OSS’s)

Everything looked good after this – the system logs indicate that the lustre 
logs are regenerated and I could mount a client.  So I failed over to the 
secondary MDS but I’m still getting the ptlrpc_expire_one_request messages on 
the OST’s after failover and all other info (‘lctl dl’ output and 
/proc/fs/lustre/mgc) seem to indicate the MGS is still pointed to the primary 
MGS node.

One other thought comes to mind.  We are using the init.d scripts (i.e. 
/etc/init.d/{lustre,lnet} and /etc/ldev.conf.  We have lnet chkconfig’ed on so 
lnet is starting on boot on all servers.  But ‘lustre’ is chkconfig’ed off so 
that if a server reboots for whatever reason we don’t get into a situation 
where we multi-mount.  On a clean boot we have to manually mount the MDT/OST’s 
(i.e. do a “service lustre start”).  To do the failover we do the 
“/etc/init.d/lustre stop local” on the primary and “/etc/init.d/lustre start 
foreign” on the secondary to do the failover.  What is the right thing to do 
with lnet on failover?  Should it be stopped on the primary node before doing a 
failover to the secondary node?  This is the state of the pro

[root@hpfs-fsl-mds0 ~]# cd /etc/init.d
[root@hpfs-fsl-mds0 init.d]# ./lustre stop
Unmounting /mnt/lustre/local/hpfs-fsl-MDT0000
 [root@hpfs-fsl-mds0 lustre]# service lustre status
partial 
[root@hpfs-fsl-mds0 lustre]# service lnet status
running
[root@hpfs-fsl-mds0 lustre]# lsmod | grep lnet
lnet                  449065  8 mdt,mgc,mgs,ko2iblnd,obdclass,ptlrpc,ksocklnd
libcfs                405310  16 
fid,fld,lod,mdd,mdt,mgc,mgs,osp,lnet,lfsck,ko2iblnd,lquota,obdclass,ptlrpc,osd_zfs,ksocklnd
[root@hpfs-fsl-mds0 lustre]#

Any other help figuring out what we are missing would be much appreciated.  
More detail below.  

Darby


tunefs.lustre command on mds (fix service nodes and do a writeconf)

       tunefs.lustre \
           --verbose \
           --force-nohostid \
           --writeconf \
           --erase-param \
           
--servicenode=${LUSTRE_LOCAL_TCP_IP}@tcp0,${LUSTRE_LOCAL_IB_IP}@o2ib0 \
           --servicenode=${LUSTRE_PEER_TCP_IP}@tcp0,${LUSTRE_PEER_IB_IP}@o2ib0 \
           $pool/meta-fsl


tunefs.lustre command on all OSS’s (fix service nodes and do a writeconf)

       tunefs.lustre \
           --verbose \
           --force-nohostid \
           --writeconf \
           --erase-param \
           --mgsnode=xxx.xxx.98.30@tcp0,xxx.xxx.0.30@o2ib0 \
           --mgsnode=xxx.xxx.98.31@tcp0,xxx.xxx.0.31@o2ib0 \
           
--servicenode=${LUSTRE_LOCAL_TCP_IP}@tcp0,${LUSTRE_LOCAL_IB_IP}@o2ib0 \
           --servicenode=${LUSTRE_PEER_TCP_IP}@tcp0,${LUSTRE_PEER_IB_IP}@o2ib0 \
           $pool/ost-fsl


Primary mds output after tunefs and remounting:


Jan 10 08:54:25 hpfs-fsl-mds0 kernel: Lustre: Lustre: Build Version: 2.9.51
Jan 10 08:54:26 hpfs-fsl-mds0 kernel: Lustre: MGS: Connection restored to 
683235c6-0848-7f44-7cec-6c4bc0897c99 (at 0@lo)
Jan 10 08:54:26 hpfs-fsl-mds0 kernel: Lustre: MGS: Logs for fs hpfs-fsl were 
removed by user request.  All servers must be restarted in order to regenerate 
the logs.
Jan 10 08:54:26 hpfs-fsl-mds0 kernel: Lustre: hpfs-fsl-MDT0000: Imperative 
Recovery not enabled, recovery window 300-900
Jan 10 08:54:51 hpfs-fsl-mds0 kernel: Lustre: hpfs-fsl-MDT0000: Connection 
restored to 683235c6-0848-7f44-7cec-6c4bc0897c99 (at 0@lo)
Jan 10 08:55:14 hpfs-fsl-mds0 kernel: Lustre: MGS: Connection restored to 
75fa2ba9-749d-e00f-84d3-e4e9b8753be3 (at xxx.xxx.98.32@tcp)
Jan 10 08:55:14 hpfs-fsl-mds0 kernel: Lustre: MGS: Regenerating 
hpfs-fsl-OST0000 log by user request.
Jan 10 08:55:22 hpfs-fsl-mds0 kernel: LustreError: 11-0: 
hpfs-fsl-OST0000-osc-MDT0000: operation ost_connect to node xxx.xxx.98.32@tcp 
failed: rc = -114
Jan 10 08:55:25 hpfs-fsl-mds0 kernel: Lustre: MGS: Connection restored to 
xxx.xxx.98.33@tcp (at xxx.xxx.98.33@tcp)
Jan 10 08:55:25 hpfs-fsl-mds0 kernel: Lustre: Skipped 1 previous similar message
Jan 10 08:55:25 hpfs-fsl-mds0 kernel: Lustre: MGS: Regenerating 
hpfs-fsl-OST0001 log by user request.
Jan 10 08:55:30 hpfs-fsl-mds0 kernel: LustreError: 11-0: 
hpfs-fsl-OST0001-osc-MDT0000: operation ost_connect to node xxx.xxx.98.33@tcp 
failed: rc = -114
Jan 10 08:56:20 hpfs-fsl-mds0 kernel: Lustre: hpfs-fsl-OST0000-osc-MDT0000: 
Connection restored to xxx.xxx.98.32@tcp (at xxx.xxx.98.32@tcp)
Jan 10 08:56:20 hpfs-fsl-mds0 kernel: Lustre: Skipped 1 previous similar message
Jan 10 08:57:03 hpfs-fsl-mds0 systemd: Starting Cleanup of Temporary 
Directories...
Jan 10 08:57:03 hpfs-fsl-mds0 systemd: Started Cleanup of Temporary Directories.
Jan 10 08:57:10 hpfs-fsl-mds0 kernel: Lustre: MGS: Connection restored to 
a51f056c-ea55-e0e4-8169-e3fa81087ffe (at xxx.xxx.98.34@tcp)
Jan 10 08:57:10 hpfs-fsl-mds0 kernel: Lustre: Skipped 1 previous similar message
Jan 10 08:57:10 hpfs-fsl-mds0 kernel: Lustre: MGS: Regenerating 
hpfs-fsl-OST0002 log by user request.
Jan 10 08:57:18 hpfs-fsl-mds0 kernel: LustreError: 11-0: 
hpfs-fsl-OST0002-osc-MDT0000: operation ost_connect to node xxx.xxx.98.34@tcp 
failed: rc = -114
Jan 10 08:57:20 hpfs-fsl-mds0 kernel: Lustre: MGS: Regenerating 
hpfs-fsl-OST0003 log by user request.
Jan 10 08:57:25 hpfs-fsl-mds0 kernel: LustreError: 11-0: 
hpfs-fsl-OST0003-osc-MDT0000: operation ost_connect to node xxx.xxx.98.35@tcp 
failed: rc = -114
Jan 10 08:57:30 hpfs-fsl-mds0 kernel: Lustre: MGS: Connection restored to 
df6d6bea-025d-30ac-6577-5101be3a9c95 (at xxx.xxx.98.36@tcp)
Jan 10 08:57:30 hpfs-fsl-mds0 kernel: Lustre: Skipped 3 previous similar 
messages
Jan 10 08:57:30 hpfs-fsl-mds0 kernel: Lustre: MGS: Regenerating 
hpfs-fsl-OST0004 log by user request.
Jan 10 08:57:31 hpfs-fsl-mds0 kernel: LustreError: 11-0: 
hpfs-fsl-OST0004-osc-MDT0000: operation ost_connect to node xxx.xxx.98.36@tcp 
failed: rc = -114
Jan 10 08:57:41 hpfs-fsl-mds0 kernel: Lustre: MGS: Regenerating 
hpfs-fsl-OST0005 log by user request.
Jan 10 08:57:45 hpfs-fsl-mds0 ntpd[5276]: 0.0.0.0 c612 02 freq_set kernel 
-26.888 PPM
Jan 10 08:57:45 hpfs-fsl-mds0 ntpd[5276]: 0.0.0.0 c615 05 clock_sync
Jan 10 08:57:46 hpfs-fsl-mds0 kernel: LustreError: 11-0: 
hpfs-fsl-OST0005-osc-MDT0000: operation ost_connect to node xxx.xxx.98.37@tcp 
failed: rc = -114
Jan 10 08:58:01 hpfs-fsl-mds0 kernel: Lustre: MGS: Regenerating 
hpfs-fsl-OST0007 log by user request.
Jan 10 08:58:01 hpfs-fsl-mds0 kernel: Lustre: Skipped 1 previous similar message
Jan 10 08:58:03 hpfs-fsl-mds0 kernel: LustreError: 11-0: 
hpfs-fsl-OST0007-osc-MDT0000: operation ost_connect to node xxx.xxx.98.39@tcp 
failed: rc = -114
Jan 10 08:58:03 hpfs-fsl-mds0 kernel: LustreError: Skipped 1 previous similar 
message
Jan 10 08:58:12 hpfs-fsl-mds0 kernel: Lustre: MGS: Connection restored to 
4c42c494-3fcc-322c-6b33-f524879b4e15 (at xxx.xxx.98.40@tcp)
Jan 10 08:58:12 hpfs-fsl-mds0 kernel: Lustre: Skipped 7 previous similar 
messages
Jan 10 08:58:33 hpfs-fsl-mds0 kernel: Lustre: MGS: Regenerating 
hpfs-fsl-OST000a log by user request.
Jan 10 08:58:33 hpfs-fsl-mds0 kernel: Lustre: Skipped 2 previous similar 
messages
Jan 10 08:58:37 hpfs-fsl-mds0 kernel: LustreError: 11-0: 
hpfs-fsl-OST000a-osc-MDT0000: operation ost_connect to node xxx.xxx.98.42@tcp 
failed: rc = -114
Jan 10 08:58:37 hpfs-fsl-mds0 kernel: LustreError: Skipped 2 previous similar 
messages
Jan 10 08:59:33 hpfs-fsl-mds0 kernel: Lustre: hpfs-fsl-OST0009-osc-MDT0000: 
Connection restored to xxx.xxx.98.41@tcp (at xxx.xxx.98.41@tcp)
Jan 10 08:59:33 hpfs-fsl-mds0 kernel: Lustre: Skipped 14 previous similar 
messages



I mounted a client and everything looked good.  So I failed over to the 
secondary MDS.  


Logs on failover MDS:


Jan 10 09:14:28 hpfs-fsl-mds1 kernel: SPL: using hostid 0x00000000
Jan 10 09:14:29 hpfs-fsl-mds1 kernel: Lustre: Lustre: Build Version: 2.9.51
Jan 10 09:14:29 hpfs-fsl-mds1 kernel: Lustre: MGS: Connection restored to 
72f30ef1-4359-ed17-9d56-e5126bc8b550 (at 0@lo)
Jan 10 09:14:29 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Imperative 
Recovery not enabled, recovery window 300-900
Jan 10 09:14:34 hpfs-fsl-mds1 kernel: Lustre: 
5396:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484061269/real 1484061269]  req@ffff881013f1a100 
x1556151029203648/t0(0) 
o38->[email protected]@tcp:12/10 lens 520/544 e 0 to 1 
dl 1484061274 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 10 09:14:42 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection 
restored to xxx.xxx.98.33@tcp (at xxx.xxx.98.33@tcp)
Jan 10 09:14:47 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection 
restored to xxx.xxx.98.41@tcp (at xxx.xxx.98.41@tcp)
Jan 10 09:14:47 hpfs-fsl-mds1 kernel: Lustre: Skipped 1 previous similar message
Jan 10 09:14:53 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection 
restored to xxx.xxx.98.37@tcp (at xxx.xxx.98.37@tcp)
Jan 10 09:14:53 hpfs-fsl-mds1 kernel: Lustre: Skipped 2 previous similar 
messages
Jan 10 09:14:57 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Will be in 
recovery for at least 5:00, or until 1 client reconnects
Jan 10 09:14:57 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Recovery over 
after 0:01, of 1 clients 1 recovered and 0 were evicted.
Jan 10 09:14:57 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection 
restored to xxx.xxx.98.35@tcp (at xxx.xxx.98.35@tcp)
Jan 10 09:14:57 hpfs-fsl-mds1 kernel: Lustre: Skipped 5 previous similar 
messages




Logs on one of the OSS’s after failover: 





Jan 10 09:14:29 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Connection 
restored to hpfs-fsl-MDT0000-mdtlov_UUID (at xxx.xxx.98.31@tcp)
Jan 10 09:14:32 hpfs-fsl-oss00 kernel: Lustre: 
17094:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484061265/real 1484061265]  req@ffff88101f486c00 
x1556149818101344/t0(0) 
o400->[email protected]@tcp:12/10 lens 224/224 e 0 to 
1 dl 1484061272 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 10 09:14:32 hpfs-fsl-oss00 kernel: Lustre: 
17093:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484061265/real 1484061265]  req@ffff88101f486900 
x1556149818101328/t0(0) o400->MGCxxx.xxx.98.30@[email protected]@tcp:26/25 
lens 224/224 e 0 to 1 dl 1484061272 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 10 09:14:32 hpfs-fsl-oss00 kernel: LustreError: 166-1: 
MGCxxx.xxx.98.30@tcp: Connection to MGS (at xxx.xxx.98.30@tcp) was lost; in 
progress operations using this service will fail
Jan 10 09:14:32 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-MDT0000-lwp-OST0000: 
Connection to hpfs-fsl-MDT0000 (at xxx.xxx.98.30@tcp) was lost; in progress 
operations using this service will wait for recovery to complete
Jan 10 09:14:38 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484061272/real 1484061272]  req@ffff88101f487200 
x1556149818101376/t0(0) o250->MGCxxx.xxx.98.30@[email protected]@tcp:26/25 
lens 520/544 e 0 to 1 dl 1484061278 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 10 09:14:57 hpfs-fsl-oss00 kernel: LustreError: 167-0: 
hpfs-fsl-MDT0000-lwp-OST0000: This client was evicted by hpfs-fsl-MDT0000; in 
progress operations using this service will fail.
Jan 10 09:14:57 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-MDT0000-lwp-OST0000: 
Connection restored to xxx.xxx.98.31@tcp (at xxx.xxx.98.31@tcp)
Jan 10 09:14:57 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: deleting 
orphan objects from 0x0:16904081 to 0x0:16904257
Jan 10 09:15:08 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484061297/real 1484061297]  req@ffff88101f487500 
x1556149818101392/t0(0) o250->MGCxxx.xxx.98.30@[email protected]@tcp:26/25 
lens 520/544 e 0 to 1 dl 1484061308 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 10 09:15:08 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 1 previous similar 
message
Jan 10 09:15:38 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484061322/real 1484061322]  req@ffff88101f487b00 
x1556149818101424/t0(0) o250->MGCxxx.xxx.98.30@[email protected]@tcp:26/25 
lens 520/544 e 0 to 1 dl 1484061338 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 10 09:16:08 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484061347/real 1484061347]  req@ffff88101f487800 
x1556149818101456/t0(0) o250->MGCxxx.xxx.98.30@[email protected]@tcp:26/25 
lens 520/544 e 0 to 1 dl 1484061368 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 10 09:16:38 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484061372/real 1484061372]  req@ffff88101f487200 
x1556149818101488/t0(0) o250->MGCxxx.xxx.98.30@[email protected]@tcp:26/25 
lens 520/544 e 0 to 1 dl 1484061398 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 10 09:17:33 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484061422/real 1484061422]  req@ffff88101f486600 
x1556149818101536/t0(0) o250->MGCxxx.xxx.98.30@[email protected]@tcp:26/25 
lens 520/544 e 0 to 1 dl 1484061453 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 10 09:18:28 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484061472/real 1484061472]  req@ffff88101f485d00 
x1556149818101584/t0(0) o250->MGCxxx.xxx.98.30@[email protected]@tcp:26/25 
lens 520/544 e 0 to 1 dl 1484061508 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1


‘lctl dl’ and /proc/fs/mgc output on oss00 after the failover – still pointing 
to the primary MGS:


[root@hpfs-fsl-oss00 ~]# lctl dl
  0 UP osd-zfs hpfs-fsl-OST0000-osd hpfs-fsl-OST0000-osd_UUID 5
  1 UP mgc MGCxxx.xxx.98.30@tcp 75fa2ba9-749d-e00f-84d3-e4e9b8753be3 5
  2 UP ost OSS OSS_uuid 3
  3 UP obdfilter hpfs-fsl-OST0000 hpfs-fsl-OST0000_UUID 7
  4 UP lwp hpfs-fsl-MDT0000-lwp-OST0000 hpfs-fsl-MDT0000-lwp-OST0000_UUID 5
[root@hpfs-fsl-oss00 ~]# ls /proc/fs/lustre/mgc/
MGCxxx.xxx.98.30@tcp
[root@hpfs-fsl-oss00 ~]#

-----Original Message-----
From: "Mohr Jr, Richard Frank (Rick Mohr)" <[email protected]>
Date: Monday, January 9, 2017 at 9:22 AM
To: Darby Vicker <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: [lustre-discuss] MGS failover problem

    Have you tried performing a writeconf to regenerate the lustre config log 
files?  This can sometimes fix the problem by making sure that everything is 
consistent.  (A writeconf is often required when making NID or failover 
changes.)  I think you could also use that opportunity to correct your 
--servicenode options if you wanted.
    
    --
    Rick Mohr
    Senior HPC System Administrator
    National Institute for Computational Sciences
    http://www.nics.tennessee.edu
    
    
    > On Jan 8, 2017, at 11:58 PM, Vicker, Darby (JSC-EG311) 
<[email protected]> wrote:
    > 
    > We have a new set of hardware we are configuring as a lustre file system. 
 We are having a problem with MGS failover and could use some help.  It was 
formatted originally using 2.8 but we have since upgraded to 2.9.  We are using 
a JBOB with server pairs for failover and are using ZFS as the backend.  All 
servers are dual-homed on both Ethernet and IB.  Combined MGS/MDS is at 
X.X.X.30 (or .31 for the failover node) and MDT was formatted as:
    > 
    > 
    >     mkfs.lustre \
    >         --fsname=hpfs-fsl \
    >         --backfstype=zfs \
    >         --reformat \
    >         --verbose \
    >         --mgs --mdt --index=0 \
    >         --servicenode=${LUSTRE_LOCAL_TCP_IP}@tcp0 
--servicenode=${LUSTRE_PEER_TCP_IP}@tcp0 \
    >         --servicenode=${LUSTRE_LOCAL_IB_IP}@o2ib0 
--servicenode=${LUSTRE_PEER_IB_IP}@o2ib0 \
    >         metadata/meta-fsl
    > 
    > 
    > And the OST’s were formatted as:
    > 
    >        mkfs.lustre \
    >            --mgsnode=xxx.xxx.98.30@tcp0,xxx.xxx.0.30@o2ib0 \
    >            --fsname=hpfs-fsl \
    >            --backfstype=zfs \
    >            --reformat \
    >            --verbose \
    >            --ost --index=$num \
    >            --servicenode=${LUSTRE_LOCAL_TCP_IP}@tcp0 
--servicenode=${LUSTRE_PEER_TCP_IP}@tcp0 \
    >            --servicenode=${LUSTRE_LOCAL_IB_IP}@o2ib0 
--servicenode=${LUSTRE_PEER_IB_IP}@o2ib0 \
    >            $pool/ost-fsl
    > 
    > 
    > 
    > We realize now there are a couple mistakes in the above.  First, it would 
have been better to put the tcp0/o2ib0 pairs in the same --servicenode line as 
a comma separated list (both MDT and OST).  Our clients are only on 1 of the 
networks so I don’t think this is big problem though.  The 2nd (bigger) problem 
is that we left out the failover MGS node in the mkfs.lustre when the OST’s 
were formatted.  To correct this we used the following:
    > 
    >       tunefs.lustre \
    >           --verbose \
    >           --force-nohostid \
    >           --mgsnode=xxx.xxx.98.31@tcp0,xxx.xxx.0.31@o2ib0 \
    >           $pool/ost-fsl
    > 
    > 
    > I think it worked since before the tunefs.lustre command a “zfs get all | 
grep mgs” showed this:
    > 
    > oss00-0/ost-fsl             lustre:mgsnode        
xxx.xxx.98.30@tcp,xxx.xxx.0.30@o2ib  local
    > 
    > And afterward it shows this:
    > 
    > oss00-0/ost-fsl             lustre:mgsnode        
xxx.xxx.98.30@tcp,xxx.xxx.0.30@o2ib:xxx.xxx.98.31@tcp,xxx.xxx.0.31@o2ib  local
    > 
    > 
    > OST failover seems to work great – clients pick up again with no problems 
and the logs on the servers don’t report any issues.  The MDT/MGC failover 
doesn’t go as well.  The clients seems to do just fine but the OSS logs start 
reporting this:
    > 
    > Jan  4 11:24:42 hpfs-fsl-oss00 kernel: Lustre: 
15713:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1483550635/real 1483550635]  req@ffff8807dc9f6300 
x1555089580422192/t0(0) o250->MGCxxx.xxx.98.30@[email protected]@tcp:26/25 
lens 520/544 e 0 to 1 dl 1483550681 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
    > 
    > 
    > And ‘lctl dl’ on an oss continues to show the primary MGC connection:
    > 
    > 
    > 
    > [root@hpfs-fsl-oss00 ~]# lctl dl
    >  0 UP osd-zfs hpfs-fsl-OST0000-osd hpfs-fsl-OST0000-osd_UUID 5
    >  1 UP mgc MGCxxx.xxx.98.30@tcp 6832efc6-4cc6-cd22-9d48-f7bc31d8930c 5
    >  2 UP ost OSS OSS_uuid 3
    >  3 UP obdfilter hpfs-fsl-OST0000 hpfs-fsl-OST0000_UUID 27
    >  4 UP lwp hpfs-fsl-MDT0000-lwp-OST0000 hpfs-fsl-MDT0000-lwp-OST0000_UUID 5
    > [root@hpfs-fsl-oss00 ~]#
    > 
    > 
    > 
    > 
    > 
    > We have transferred a lot of data to the this LFS in preparation for 
going production so we’d like to try not to reformat the LFS if possible, but 
that is an option if needed.  Are we still missing something from the initial 
mkfs.lustre missteps or is there something else we are missing?  
    > 
    > Thanks
    > Darby
    > 
    > 
    > 
    > _______________________________________________
    > lustre-discuss mailing list
    > [email protected]
    > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
    
    
    
    

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] MGS failover problem

Reply via email to