I just tried out this configuration and was able to reproduce what Scott
saw on 2.12.2.
I couldn't see a Jira ticket for this though so I've opened one a new
one: https://jira.whamcloud.com/browse/LU-12506
Cheers,
--
Matt Rásó-Barnett
University of Cambridge
On Wed, May 22, 2019 at 08:02:59AM +0000, Andreas Dilger wrote:
Scott, if you haven't already done so, it is probably best to file a
ticket in Jira with the details. Please include the client
syslog/dmesg as well as a Lustre debug log ("lctl dk /tmp/debug") so
that the problem can be isolated.
During DNE development we tested with up to 128 MDTs in AWS, but
haven't tested that many MDTs in some time.
Cheers, Andreas
On May 8, 2019, at 12:28, White, Scott F <sfpwh...@lanl.gov> wrote:
We’ve been testing DNE Phase II and tried scaling the number of
MDSes(one MDT each for all of our tests) very high, but when we did
that, we couldn’t mount the filesystem on a client. After trial and
error, we discovered that we were unable to mount the filesystem when
there were 56 MDSes. 55 MDSes mounted without issue, and it appears
any number below that will mount. This failure at 56 MDSes was
replicable across different nodes being used for the MDSes, all of
which were tested with working configurations, so it doesn’t seem to
be a bad server.
Here’s the error info we saw in dmesg on the client:
LustreError: 28880:0:(obd_config.c:559:class_setup()) setup
lustre-MDT0037-mdc-ffff95923d31b000 failed (-16)
LustreError: 28880:0:(obd_config.c:1836:class_config_llog_handler())
MGCx.x.x.x@o2ib: cfg command failed: rc = -16
Lustre: cmd=cf003 0:lustre-MDT0037-mdc 1:lustre-MDT0037_UUID
2:x.x.x.x@o2ib
LustreError: 15c-8: MGCx.x.x.x@o2ib: The configuration from log
'lustre-client' failed (-16). This may be the result of communication
errors between this node and the MGS, a bad configuration, or other
errors. See the syslog for more information.
LustreError: 28858:0:(obd_config.c:610:class_cleanup()) Device 58 not
setup
Lustre: Unmounted lustre-client
LustreError: 28858:0:(obd_mount.c:1608:lustre_fill_super()) Unable to
mount (-16)
OS: CentOS 7.6.1810
Kernel: 3.10.0-957.5.1.el7.x86_64
Lustre: 2.12.1
Network card: Qlogic InfiniPath_QLE7340
Other things to note for completeness’ sake: this happened with both
ldiskfs and zfs backfstypes, and these tests were using files in
memory as the backing devices.
Is there something I’m missing as to why more than 56 MDSes won’t
mount?
Thanks,
Scott White
Scientist, HPC
Los Alamos National Laboratory
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org