[lustre-discuss] Failure migrating OSTs in KVM lustre 2.7.0 testbed

Scott Wood Wed, 29 Nov 2017 15:10:00 -0800

Hi folks,


In an effort to replicate a production environment to do a test upgrade, I've 
created a six server KVM testbed on a Centos 7.4 host with CentOS 6 guests.    
I have four OSS and two MDSs.  I have qcow2 virtual disks visible to the 
servers in pairs.  Each OSS has two OSTs and can also mount its paired server's 
two OSTs.  I have separate MGT and MGT volumes, again, both visible and 
mountable by either MDS.  When I unmount an OST from one of the OSSs and try to 
mount it on what will be its HA pair (failing over manually now until I get it 
working, then I'll install corosync and pacemaker), the second guest to mount 
the OST *occasionally* fails as follows:


[root@fakeoss4 ~]# mount /mnt/OST7
mount.lustre: increased /sys/block/vde/queue/max_sectors_kb from 1024 to 
2147483647
mount.lustre: mount /dev/vde at /mnt/OST7 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)

And, from /var/log/messages:

Nov 29 10:55:33 fakeoss4 kernel: LDISKFS-fs (vdd): mounted filesystem with 
ordered data mode. quota=on. Opts:
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
2326:0:(llog_osd.c:236:llog_osd_read_header()) fake-OST0006-osd: bad log 
fake-OST0006 [0xa:0x10:0x0] header magic: 0x0 (expected 0x10645539)
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
2326:0:(mgc_request.c:1739:mgc_llog_local_copy()) MGC192.168.122.5@tcp: failed 
to copy remote log fake-OST0006: rc = -5
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 13a-8: Failed to get MGS log 
fake-OST0006 and no local copy.
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 15c-8: MGC192.168.122.5@tcp: The 
configuration from log 'fake-OST0006' failed (-2). This may be the result of 
communication errors between this node and the MGS, a bad configuration, or 
other errors. See the syslog for more information.
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
2326:0:(obd_mount_server.c:1299:server_start_targets()) failed to start server 
fake-OST0006: -2
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
2326:0:(obd_mount_server.c:1783:server_fill_super()) Unable to start targets: -2
Nov 29 10:55:33 fakeoss4 kernel: LustreError: 
2326:0:(obd_mount_server.c:1498:server_put_super()) no obd fake-OST0006
Nov 29 10:55:34 fakeoss4 kernel: Lustre: server umount fake-OST0006 complete
Nov 29 10:55:34 fakeoss4 kernel: LustreError: 
2326:0:(obd_mount.c:1339:lustre_fill_super()) Unable to mount  (-2)

The OSS that fails to mount can see the MGS in question:

[root@fakeoss4 ~]# lctl ping 192.168.122.5
12345-0@lo
12345-192.168.122.5@tcp


The environment was built as follows:  A guest VM was installed from CentOS-6.5 
install media. The kernel was then updated to 2.6.32-504.8.1.el6_lustre.x86_64 
from the Intel repos,.  The intel binary rpms for lustre were then installed.  
"exclude=kernel*" was added to /etc/yum.repos.d and a "yum update" was run, so 
its an up to day system with the exception of the locked down kernel.  
e2fsprogs-1.42.12.wc1-7.el6.x86_64 is the version installed.  The VM was then 
cloned to make the six lustre servers and the filesystems were created with the 
following options:


[root@fakemds1 ~]# mkfs.lustre --fsname=fake --mgs 
--servicenode=192.168.122.5@tcp0 --servicenode=192.168.122.67@tcp0 /dev/vdb

[root@fakemds1 ~]# mkfs.lustre --reformat --fsname=fake --mdt --index=0 
--servicenode=192.168.122.5@tcp0 --servicenode=192.168.122.67@tcp0 
--mgsnode=192.168.122.5@tcp0:192.168.122.67@tcp0 /dev/vdc


[root@fakeoss1 ~]# mkfs.lustre --reformat --fsname=fake --ost --index=0 
--servicenode=192.168.122.197@tcp0 --servicenode=192.168.122.238@tcp0 
--mgsnode=192.168.122.5@tcp0:192.168.122.67@tcp0 /dev/vdb #repeated for 3 more 
OTSs with changed index and devices appropriately


[root@fakeoss3 ~]# mkfs.lustre --reformat --fsname=fake --ost --index=4 
--servicenode=192.168.122.97@tcp0 --servicenode=192.168.122.221@tcp0 
--mgsnode=192.168.122.5@tcp0:192.168.122.67@tcp0 /dev/vdb #repeated for 3 more 
OTSs with changed index and devices appropriately


Virtual disks were set as shareable and made visible to their correct VMs and 
often do mount, but occasionally (more than half the time) fail as above.  Have 
I missed any important information that could point to the cause?


Once I get this VM environment stable, I intend to update it to lustre 2.10.1.  
Thanks in advance for any troubleshooting tips you can provide.


Cheers

Scott

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] Failure migrating OSTs in KVM lustre 2.7.0 testbed

Reply via email to