[lustre-discuss] Multi-cluster (multi-rail) setup

2015-06-12 Thread Thrash Er
New to Lustre O:)

I have to install and configure a Lustre storage for 4 small clusters
(4 different departments). Each cluster has its own IB QDR
interconnect for MPI (and now Lustre) and its own 1 GigE management
network. IB networks would be something like:
 Cluster A  192.168.1.0  o2ib0(ib0)
 Cluster B  192.168.2.0  o2ib1(ib1)
 Cluster C  192.168.3.0  o2ib2(ib2)
 Cluster D  192.168.4.0  o2ib3(ib3)

I've gone through the Lustre Operations Manual 2.x and, from what I
understood, I would have to:

1.- add 4 IB ports to each OSS and MDS/MGT and cable them like this:
 IB Port 0 - cluster A
 IB Port 1 - cluster B
 IB Port 2 - cluster C
 IB Port 3 - cluster D

2.- configure /etc/modprobe.d/lustre.conf on the OSS and MDS like this:

 options lnet networks=o2ib0(ib0),o2ib1(ib1),o2ib2(ib2),o2ib3(ib3)

3.- configure /etc/modprobe.d/lustre.conf on each node of each cluster
like this:

 Nodes con Cluster A:  options lnet networks=o2ib0(ib0)

 Nodes con Cluster B:  options lnet networks=o2ib1(ib1)

 Nodes con Cluster C:  options lnet networks=o2ib2(ib2)

 Nodes con Cluster D:  options lnet networks=o2ib3(ib3)


S, questions:
   1.- Are my assumptions correct?
   2.- No need for LNET routers, right?
   3.- Am I missing something?

Thanks !!
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Problem on some client that don't want to remount filesystem ( server on 2.5.3 )

2015-06-12 Thread Philippe Weill

hello

we add a problem on our lustre 2.5.3 infrastructure
small cluster 24 nodes

mds rebooted alone and
now some client refused to mount again the filesystem

client are in 1.8.9wc1 since we're in migration phase

all clients could mount the 1.8 version
6 client can't mount again the 2.5.3 filesystem

Jun 12 09:52:52 ciclad11 kernel: Lustre: Server MGS version (2.5.3.0) is much 
newer than client version (1.8.9)
Jun 12 09:52:52 ciclad11 kernel: Lustre: MGC172.20.3.74@o2ib: Reactivating 
import
Jun 12 09:52:52 ciclad11 kernel: Lustre: MGC172.20.3.74@o2ib: Connection 
restored to service MGS using nid 172.20.3.74@o2ib.
Jun 12 09:52:52 ciclad11 kernel: Lustre: client 
etherfs-client(88040ef2f800) umount complete
Jun 12 09:52:52 ciclad11 kernel: LustreError: 
4754:0:(obd_mount.c:2067:lustre_fill_super()) Unable to mount  (-4)

log from mds

Jun 12 09:52:52 mds2-ipsl kernel: Lustre: MGS: Client 
e26b1313-d901-a410-7c8b-6c6148b6bd92 (at 172.20.3.243@o2ib) reconnecting
Jun 12 09:53:45 mds2-ipsl kernel: Lustre: MGS: Client 
a3cd5035-35d2-4f23-e337-73d0e7192047 (at 172.20.3.243@o2ib) reconnecting
Jun 12 09:54:38 mds2-ipsl kernel: Lustre: MGS: Client 
b7896a48-b23d-1651-4b3f-fa5c90cceab7 (at 172.20.3.243@o2ib) reconnecting
Jun 12 09:55:41 mds2-ipsl kernel: Lustre: MGS: haven't heard from client b25d7b77-22b5-c391-b883-7ae8f2044d09 (at 172.20.3.243@o2ib) 
in 228 seconds. I think it's dead, and I am evicting it. exp 8811c86af400, cur 1434095741 expire 1434095591 last 1434095513



I try to change the client version on not working client

Jun 12 12:01:35 ciclad19 kernel: Lustre: 13789:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow 
reply: [sent 1434103288/real 1434103288]  req@88301856b800 x1503749299241316/t0(0) 
o503-MGC172.20.3.74@o2ib@172.20.3.74@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1434103295 ref 2 fl Rpc:X/0/ rc 0/-1
Jun 12 12:01:35 ciclad19 kernel: LustreError: 166-1: MGC172.20.3.74@o2ib: Connection to MGS (at 172.20.3.74@o2ib) was lost; in 
progress operations using this service will fail
Jun 12 12:01:41 ciclad19 kernel: Lustre: 3851:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow 
reply: [sent 1434103295/real 1434103295]  req@88381ba7fc00 x1503749299241320/t0(0) 
o250-MGC172.20.3.74@o2ib@172.20.3.74@o2ib:26/25 lens 400/544 e 0 to 1 dl 1434103301 ref 1 fl Rpc:XN/0/ rc 0/-1
Jun 12 12:01:48 ciclad19 kernel: LustreError: 15c-8: MGC172.20.3.74@o2ib: The configuration from log 'etherfs-client' failed (-5). 
This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog 
for more information.

Jun 12 12:01:48 ciclad19 kernel: LustreError: 
13789:0:(llite_lib.c:1046:ll_fill_super()) Unable to process log: -5
Jun 12 12:01:48 ciclad19 kernel: Lustre: Unmounted etherfs-client
Jun 12 12:01:48 ciclad19 kernel: LustreError: 
13789:0:(obd_mount.c:1325:lustre_fill_super()) Unable to mount  (-5)

from mds
Jun 12 12:01:35 mds2-ipsl kernel: Lustre: MGS: Client 
5c623fa9-1cae-6b75-5e15-acb8add53042 (at 172.20.3.235@o2ib) reconnecting
Jun 12 12:05:25 mds2-ipsl kernel: Lustre: MGS: haven't heard from client 5c623fa9-1cae-6b75-5e15-acb8add53042 (at 172.20.3.235@o2ib) 
in 230 seconds. I think it's dead, and I am evicting it. exp 88203ddda000, cur 1434103525 expire 1434103375 last 1434103295


any idea


--
Weill Philippe -  Administrateur Systeme et Reseaux
CNRS/UPMC/IPSL   LATMOS (UMR 8190)
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] trouble mounting after a tunefs

2015-06-12 Thread John White
Good Morning Folks,
We recently had to add TCP NIDs to an existing o2ib FS.  We added the 
nid to the modprobe.d stuff and tossed the definition of the NID in the 
failnode and mgsnode params on all OSTs and the MGS + MDT.  When either an o2ib 
or tcp client try to mount, the mount command hangs and dmesg repeats:
LustreError: 11-0: brc-MDT-mdc-881036879c00: Communicating with 
10.4.250.10@o2ib, operation mds_connect failed with -11.

I fear we may have over-done the parameters, could anyone take a look here and 
let me know if we need to fix things up (remove params, etc)?

MGS:
Read previous values:
Target: MGS
Index:  unassigned
Lustre FS:  
Mount type: ldiskfs
Flags:  0x4
  (MGS )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:

MDT:
 Read previous values:
Target: brc-MDT
Index:  0
Lustre FS:  brc
Mount type: ldiskfs
Flags:  0x1001
  (MDT no_primnode )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:  
mgsnode=10.4.250.11@o2ib,10.0.250.11@tcp:10.4.250.10@o2ib,10.0.250.10@tcp  
failover.node=10.4.250.10@o2ib,10.0.250.10@tcp:10.4.250.11@o2ib,10.0.250.11@tcp 
mdt.quota_type=ug

OST(sample):
Read previous values:
Target: brc-OST0002
Index:  2
Lustre FS:  brc
Mount type: ldiskfs
Flags:  0x1002
  (OST no_primnode )
Persistent mount opts: errors=remount-ro
Parameters:  
mgsnode=10.4.250.10@o2ib,10.0.250.10@tcp:10.4.250.11@o2ib,10.0.250.11@tcp  
failover.node=10.4.250.12@o2ib,10.0.250.12@tcp:10.4.250.13@o2ib,10.0.250.13@tcp 
ost.quota_type=ug
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] trouble mounting after a tunefs

2015-06-12 Thread Martin Hecht
Hi John,

on the Parameters line the different nodes should not be separated by
:. Each node should be specified by a separate mgsnode=... or
failover.node=... statement. I'm not sure if separating the two
interfaces of each node by , is correct here, or if this should be
splitted again in two separate statements.

best regards,
Martin

On 06/12/2015 05:07 PM, John White wrote:
 Good Morning Folks,
   We recently had to add TCP NIDs to an existing o2ib FS.  We added the 
 nid to the modprobe.d stuff and tossed the definition of the NID in the 
 failnode and mgsnode params on all OSTs and the MGS + MDT.  When either an 
 o2ib or tcp client try to mount, the mount command hangs and dmesg repeats:
 LustreError: 11-0: brc-MDT-mdc-881036879c00: Communicating with 
 10.4.250.10@o2ib, operation mds_connect failed with -11.

 I fear we may have over-done the parameters, could anyone take a look here 
 and let me know if we need to fix things up (remove params, etc)?

 MGS:
 Read previous values:
 Target: MGS
 Index:  unassigned
 Lustre FS:  
 Mount type: ldiskfs
 Flags:  0x4
   (MGS )
 Persistent mount opts: user_xattr,errors=remount-ro
 Parameters:

 MDT:
  Read previous values:
 Target: brc-MDT
 Index:  0
 Lustre FS:  brc
 Mount type: ldiskfs
 Flags:  0x1001
   (MDT no_primnode )
 Persistent mount opts: user_xattr,errors=remount-ro
 Parameters:  
 mgsnode=10.4.250.11@o2ib,10.0.250.11@tcp:10.4.250.10@o2ib,10.0.250.10@tcp  
 failover.node=10.4.250.10@o2ib,10.0.250.10@tcp:10.4.250.11@o2ib,10.0.250.11@tcp
  mdt.quota_type=ug

 OST(sample):
 Read previous values:
 Target: brc-OST0002
 Index:  2
 Lustre FS:  brc
 Mount type: ldiskfs
 Flags:  0x1002
   (OST no_primnode )
 Persistent mount opts: errors=remount-ro
 Parameters:  
 mgsnode=10.4.250.10@o2ib,10.0.250.10@tcp:10.4.250.11@o2ib,10.0.250.11@tcp  
 failover.node=10.4.250.12@o2ib,10.0.250.12@tcp:10.4.250.13@o2ib,10.0.250.13@tcp
  ost.quota_type=ug
 ___
 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org
 http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Multi-cluster (multi-rail) setup

2015-06-12 Thread Chris Horn
Hello and welcome to Lustre :)

 3.- configure /etc/modprobe.d/lustre.conf on each node of each cluster
 like this:
 
 Nodes con Cluster A:  options lnet networks=o2ib0(ib0)
 
 Nodes con Cluster B:  options lnet networks=o2ib1(ib1)
 
 Nodes con Cluster C:  options lnet networks=o2ib2(ib2)
 
 Nodes con Cluster D:  options lnet networks=o2ib3(ib3)”

The “(ibX)” portion of that string should correspond to the local IB interface 
that the clients in those clusters are actually using. i.e which port on the 
clients is active, not the port that is used by servers on that LNet. My guess 
is that the clients have a single IB HCA with a cable plugged into port 0, so 
that what you probably want is:

Nodes con Cluster A:  options lnet networks=o2ib0(ib0)

Nodes con Cluster B:  options lnet networks=o2ib1(ib0)

Nodes con Cluster C:  options lnet networks=o2ib2(ib0)

Nodes con Cluster D:  options lnet networks=o2ib3(ib0)”

Again, that’s just a guess on how these things are typically configured. You’ll 
want to check if that is actually case for your clusters.

Chris Horn

 On Jun 12, 2015, at 2:37 AM, Thrash Er mingorrubi...@gmail.com wrote:
 
 New to Lustre O:)
 
 I have to install and configure a Lustre storage for 4 small clusters
 (4 different departments). Each cluster has its own IB QDR
 interconnect for MPI (and now Lustre) and its own 1 GigE management
 network. IB networks would be something like:
 Cluster A  192.168.1.0  o2ib0(ib0)
 Cluster B  192.168.2.0  o2ib1(ib1)
 Cluster C  192.168.3.0  o2ib2(ib2)
 Cluster D  192.168.4.0  o2ib3(ib3)
 
 I've gone through the Lustre Operations Manual 2.x and, from what I
 understood, I would have to:
 
 1.- add 4 IB ports to each OSS and MDS/MGT and cable them like this:
 IB Port 0 - cluster A
 IB Port 1 - cluster B
 IB Port 2 - cluster C
 IB Port 3 - cluster D
 
 2.- configure /etc/modprobe.d/lustre.conf on the OSS and MDS like this:
 
 options lnet networks=o2ib0(ib0),o2ib1(ib1),o2ib2(ib2),o2ib3(ib3)
 
 3.- configure /etc/modprobe.d/lustre.conf on each node of each cluster
 like this:
 
 Nodes con Cluster A:  options lnet networks=o2ib0(ib0)
 
 Nodes con Cluster B:  options lnet networks=o2ib1(ib1)
 
 Nodes con Cluster C:  options lnet networks=o2ib2(ib2)
 
 Nodes con Cluster D:  options lnet networks=o2ib3(ib3)
 
 
 S, questions:
   1.- Are my assumptions correct?
   2.- No need for LNET routers, right?
   3.- Am I missing something?
 
 Thanks !!
 ___
 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org
 http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org