Client was rebuilt locally from the source RPM's. I thought I had built it from the client source from the nightly build but I can see now it was the 2.7.0 source

lustre-client-2.7.0-2.6.32_504.8.1.el6.x86_64.src.rpm

 Client kernel is the OS provided kernel.

At this point I have ripped out all of the 2.7.0 based install and re-built everything with the current 2.5.3 pre-built RPMS for the server. The test client is RHEL 6.7 so I built the client locally against the current kernel. I can now mount the filesystem at least.



On 12/04/2015 09:24 AM, jerome.be...@inserm.fr wrote:
Ok,

I am not using IB here but it looks obvious that the max_frag value
differs between the MGS and the client.

Do you use the same lustre version on the MGS/OSS AND the client built
on the same Kernel version ? (ie lustre*-KERNEL_VERSION-LUSTRE_VERSION)

Did you try it with the latest nightly build ?

If so, i let developers answer or maybe you can open a bug

Regards

Le 04-12-2015 15:48, Ray Muno a écrit :
As I mentioned, I am doing a test install to see what I want to run
for deployment.  We have run a couple Lustre installs, one 1.8.x based
and a current production one that is 2.3. The Lustre 2.3 server set
has been up for 750 days and has been very solid.  This test replaces
the old 1.8 setup and I need to come up with a consistent set of sever
and clients that I can run on our clusters. The cluster (Rocks based)
will get upgraded, most likely, once we have a working set.  I have a
set of compute nodes that will be set up to run either CentOS 6.6 or
6.7.

I started with 2.7 since that is what I got pointed to when I went to
the lustre.org download page. The "Most Recent Release" points me at
the 2.7.0 tree.  If I follow the path to download source on that page,

git clone git://git.hpdd.intel.com/fs/lustre-release.git

It is not even apparent from the downloaded tree which version I would
be building. The Changelog file mentions 2.8 and 2.7. Everything on
the Lustre Download page seems to indicate I should be downloading
2.7.

Since I started with a clean install of a RHEL 6.6 on my server set, I
had the expectation that that pre-compiled server binaries would give
me a working set to test. That is when the frustration started. I
tried searching for clues by looking at errors that I saw, but I did
not find much that duplicated what I was seeing. I just saw some odd
mentions about IB having issues in 2.6.32-504.8.1.  This did not
directly correlate with my issues but I figured I would try a later
kernel. That is whey I pulled the nightly build off of
build.hpdd.intel.com and found I could at least establish a set of
servers that would talk to each other.

That is where I am at now. I am trying to wrap my head around where my
issues lie. Is the problem specific to my Qlogic InfiniPath_QLE7240
cards?  Is it the underlying OS provided IB drivers?  I guess I am
just really surprised that the distribution pointed to on the download
page, fails out of the box on a set of servers with a clean install of
the specified OS. I just figured I must be doing something wrong
(which may still be the case).

At this point, it looks like I should be backing out 2.7 and build
this with the current 2.5 release.

Before I do that, however, I would like to gain some understanding as
to what I am seeing right now.  I have the server set built with 2.7.0
and the 2.6.32-573.8.1.el6_lustre.g8438f2a.x86_64 kernel on RHEL 6.6
(SL 6.6).


I rebuilt the 2.7.0 Lustre client on a RHEL (CentOS) 6.6 client, and I
could not mount the file system. It will mount my production Lustre
file system from another server set (2.3.0) with out a problem.  I
also tried with a RHEL 6.7 install, with the 2.7 Lustre client rebuilt
for the kernel (2.6.32-573.8.1.el6.x86_64). The client will not mount
the 2.7 Lustre file system and I cannot even (lctl ping) the server
from the client.

On the client

[root@athena-head ~]# lctl ping  172.19.120.29@o2ib
failed to ping 172.19.120.29@o2ib: Input/output error

In dmesg

LNetError: 1444:0:(o2iblnd_cb.c:2649:kiblnd_rejected())
172.19.120.29@o2ib rejected: incompatible # of RDMA fragments 32, 256

On the Lustre MDS server.

Dec  3 18:14:08 lustre-mds kernel: LNet:
1493:0:(o2iblnd_cb.c:2278:kiblnd_passive_connect()) Can't accept conn
from 172.19.120.2@o2ib (version 12): max_frags 256 too large (32
wanted)

Trying to mount on the client

[root@athena-head ~]# uname -a
Linux athena-head.aem.umn.edu 2.6.32-573.8.1.el6.x86_64 #1 SMP Tue Nov
10 18:01:38 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

[root@athena-head ~]# mount -t lustre  172.19.120.29@o2ib:/ltest /ltest
mount.lustre: mount 172.19.120.29@o2ib:/ltest at /ltest failed:
Input/output error
Is the MGS running?

Dec  3 18:21:16 athena-head kernel: LNetError:
1444:0:(o2iblnd_cb.c:2649:kiblnd_rejected()) 172.19.120.29@o2ib
rejected: incompatible # of RDMA fragments 32, 256
Dec  3 18:21:16 athena-head kernel: Lustre:
6091:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent
has failed due to network error: [sent 1449188476/real 1449188476]
req@ffff88002f810c80 x1519567173058612/t0(0)
o250->MGC172.19.120.29@o2ib@172.19.120.29@o2ib:26/25 lens 400/544 e 0
to 1 dl 1449188481 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec  3 18:21:41 athena-head kernel: LNetError:
1444:0:(o2iblnd_cb.c:2649:kiblnd_rejected()) 172.19.120.29@o2ib
rejected: incompatible # of RDMA fragments 32, 256
Dec  3 18:21:41 athena-head kernel: Lustre:
6091:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent
has failed due to network error: [sent 1449188501/real 1449188501]
req@ffff88021e742c80 x1519567173058628/t0(0)
o250->MGC172.19.120.29@o2ib@172.19.120.29@o2ib:26/25 lens 400/544 e 0
to 1 dl 1449188511 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec  3 18:21:53 athena-head kernel: LustreError: 15c-8:
MGC172.19.120.29@o2ib: The configuration from log 'ltest-client'
failed (-5). This may be the result of communication errors between
this node and the MGS, a bad configuration, or other errors. See the
syslog for more information.
Dec  3 18:21:53 athena-head kernel: Lustre: Unmounted ltest-client
Dec  3 18:21:53 athena-head kernel: LustreError:
7346:0:(obd_mount.c:1339:lustre_fill_super()) Unable to mount  (-5)

On the server

Dec  3 18:21:41 lustre-mds kernel: LNet:
1493:0:(o2iblnd_cb.c:2278:kiblnd_passive_connect()) Can't accept conn
from 172.19.120.2@o2ib (version 12): max_frags 256 too large (32
wanted)



On 12/04/2015 06:49 AM, jerome.be...@inserm.fr wrote:
Hi,


I honestly don't know if the compiled versions available here are meant
to be used by everyone but they are publicly browsable on Intel
Jenkins :

https://build.hpdd.intel.com

but as the source is publicly available from the whamcloud git, there
imo might not be any problem

If you are in production stick to the 2.5.

Regards


Le 04-12-2015 12:18, Jon Tegner a écrit :
Hi,

Where do you find the 2.7.x-releases? I thought fixes were only
released for the Intel maintenance version?

Regards,

/jon

On 12/04/2015 11:43 AM, jerome.be...@inserm.fr wrote:
Hello Ray,

One consideration first : You try the 2.7 version which is not the
production one (aka 2.5). From this perspective wether you run 2.7.0
or 2.7.x won't make any big difference, it is the develpment release.

Then if I understand the problem comes from the infiniband driver
module which is buggy in the 2.6.32-504.8.1 kernel, meaning that you
have to update the kernel to fix it. Doing this may result that the
2.7.0 version on the site, compiled on an older kernel version, will
refuse to load then. (because kernel modules - i.e the lustre ones
here -  relies on features that may change between different kernel
version making it incompatible)

In any case you can try to rebuild the 2.7.0 version from the source
to your new kernel. The procedure is quite easy :

https://wiki.hpdd.intel.com/display/PUB/Rebuilding+the+Lustre-client+rpms+for+a+new+kernel

It will regenerate the 2.7.0 client uppon your newer kernel with the
working infinband modules, but the stability is not garanted as the
2.7 branch is under development anyway.

Or use a precompiled one on the build site if you can't (some nasty
bugs in the base 2.x.0 version are fixed in the latest builds)

The only thing is to stick to the very same version on mds and oss
and at least the same or newer version for the clients.

Regards

Le 03-12-2015 16:13, Ray Muno a écrit :
I am trying to set up a test deployment of Lustre 2.7.

I pulled RPMS from http://lustre.org/download/ and installed them
on a
set of server running Scientific Linux 6.6 which seems to be a proper
OS for deployment.  Everything installs and I can format the
filesystems on the MDS (1) and OSS (2) servers. When I try and mount
the OST files systems, I get communication errors. I can "lctl ping"
the servers from each other, but cannot establish communication
between the MDS and OSS.

The installation is on servers connected over Infiniband (Qlogic DDR
4X).

In trying to diagnose the issues related to the error messages, I
found mention in some list discussions that o2ib is broken in the
2.6.32-504.8.1 kernel.

After much frustration, I pulled a nightly build from
build.hpdd.intel.com (kernel
2.6.32-573.8.1.el6_lustre.g8438f2a.x86_64) and tried the same set up.
Everything worked as I expected.

Am I missing something? Is the default release pointed to at
https://downloads.hpdd.intel.com/ for 2.7 broken in some way? Is it
just the hardware I am trying to deploy against?

I can provide specifics about the errors I see, I am just posting
this
to make sure I am pulling the Lustre RPM's from the proper source.
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


--

 Ray Muno
 Computer Systems Administrator
 e-mail:   m...@aem.umn.edu
 Phone:   (612) 625-9531
 FAX:     (612) 626-1558

                          University of Minnesota
 Aerospace Engineering and Mechanics         Mechanical Engineering
 110 Union St. S.E.                          111 Church Street SE
 Minneapolis, MN 55455                       Minneapolis, MN 55455

--

 Ray Muno
 Computer Systems Administrator
 e-mail:   m...@aem.umn.edu
 Phone:   (612) 625-9531
 FAX:     (612) 626-1558

                          University of Minnesota
 Aerospace Engineering and Mechanics         Mechanical Engineering
 110 Union St. S.E.                          111 Church Street SE
 Minneapolis, MN 55455                       Minneapolis, MN 55455
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


--

 Ray Muno
 Computer Systems Administrator
 e-mail:   m...@aem.umn.edu
 Phone:   (612) 625-9531
 FAX:     (612) 626-1558

                          University of Minnesota
 Aerospace Engineering and Mechanics         Mechanical Engineering
 110 Union St. S.E.                          111 Church Street SE
 Minneapolis, MN 55455                       Minneapolis, MN 55455

--

 Ray Muno
 Computer Systems Administrator
 e-mail:   m...@aem.umn.edu
 Phone:   (612) 625-9531
 FAX:     (612) 626-1558

                          University of Minnesota
 Aerospace Engineering and Mechanics         Mechanical Engineering
 110 Union St. S.E.                          111 Church Street SE
 Minneapolis, MN 55455                       Minneapolis, MN 55455
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to