Hi Everyone,

I never got any reply or suggestions from this one.  We are still having the 
issue.  Summarizing: the clients get the wrong address for the MDS when our 
LMD01 node is running the service.   If LMD02 (the active/passive HA partner to 
LMD01) runs as the MDS things work.   

Some further information that may be helpful showing the 'tunefs.lustre 
--print' details of the MDT:

r...@lmd01 ~# tunefs.lustre --mdt --print /dev/sdd checking for existing Lustre 
data: found CONFIGS/mountdata Reading CONFIGS/mountdata

    Read previous values:
Target:     umt3-MDT0000
Index:      0
Lustre FS:  umt3
Mount type: ldiskfs
Flags:      0x1
               (MDT )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: 
mgsnode=10.10.1....@tcp,192.41.230....@tcp1,141.211.101....@tcp2
failover.node=10.10.1...@tcp,192.41.230...@tcp1


    Permanent disk data:
Target:     umt3-MDT0000
Index:      0
Lustre FS:  umt3
Mount type: ldiskfs
Flags:      0x1
               (MDT )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: 
mgsnode=10.10.1....@tcp,192.41.230....@tcp1,141.211.101....@tcp2 
failover.node=10.10.1...@tcp,192.41.230...@tcp1

Notice there is no reference to 192.41.230...@tcp anywhere here.   

Thanks for any suggestions,

Shawn

-----Original Message-----
From: McKee, Shawn 
Sent: Monday, May 10, 2010 2:28 PM
To: 'lustre-discuss@lists.lustre.org'
Cc: aglt2-ad...@umich.edu
Subject: Clients getting incorrect network information for one of two MDT 
servers (active/passive)

Hi Everyone,

We are having a problem with Lustre v1.8.3/x86_64 (ext4 flavor if it matters).  
We are very new to using Lustre so our problem may be trivial to those with 
experience. 

We have setup a separate MGS server and we have an HA setup for our MDT.  There 
are two servers with a backend iSCSI storage area for the MDT on 
lmd01.aglt2.org/lmd02.aglt2.org (active/passive using RedHat clustering).  All 
nodes are dual-homed (private and public networks).  Failover works without a 
problem modulo the issue we are asking about.

The primary problem is that one of the MDT nodes (LMD01) seems to be 
unreachable from the clients.  We have configured lnet to use the private 
network to mount/access Lustre.  The lnet line in /etc/modprobe.conf looks like 
this on an MDT server:

options lnet networks=tcp0(bond0.4010) routes="tcp2 10.10.1.[50-...@tcp0"

(We also have some routing for an external public network to allow clients 
there to mount...not sure it is relevant to our problem.  I can provide details 
if it is useful)

The 'bond0.4010' is the private network.  The clients on this  private network 
look similar:

options lnet networks=tcp0(eth0)

The relevant IPs:   lmd01 has 10.10.1.48 (private) and 192.41.230.48 (public)
                    lmd02 has 10.10.1.49 (private) and 192.41.230.49 (public)

The problem we have is shown in the 'lctl --net tcp0 peer_list' output:

[r...@bl-11-1 ~]# lctl --net tcp0  peer_list
12345-10.10.1...@tcp [1]bl-11-1.local->umfs06.local:988 #3
12345-10.10.1...@tcp [1]bl-11-1.local->umfs16.local:988 #3
12345-10.10.1....@tcp [1]bl-11-1.local->mgs.local:988 #3
12345-10.10.1...@tcp [2]bl-11-1.local->lmd02.local:988 #6
12345-192.41.230...@tcp [1116]0.0.0.0->lmd01.aglt2.org:988 #0
12345-10.10.1...@tcp [1]bl-11-1.local->umfs05.local:988 #3

Notice the "public" address 192.41.230.48 showing up on the 'tcp' ('tcp0') 
network?   This seems to be the problem.  If LMD01 takes over actively serving 
the MDT we see things like the following in the logs:

2010-05-10T12:21:01-04:00 lmd01.aglt2.org kernel: [272846.750287] LustreError: 
120-3: Refusing connection from 192.41.237.235 for 192.41.230...@tcp: No 
matching NI
2010-05-10T12:23:46-04:00 lmd01.aglt2.org kernel: [273011.595403] LustreError: 
120-3: Refusing connection from 192.41.237.235 for 192.41.230...@tcp: No 
matching NI
2010-05-10T12:29:01-04:00 lmd01.aglt2.org kernel: [273326.290186] LustreError: 
120-3: Refusing connection from 192.41.230.203 for 192.41.230...@tcp: No 
matching NI
2010-05-10T12:48:11-04:00 lmd01.aglt2.org kernel: [274475.351001] LustreError: 
120-3: Refusing connection from 192.41.230.168 for 192.41.230...@tcp: No 
matching NI

This makes sense because LMD01 is NOT supposed to be using its public IP for 
Lustre.   The strange thing is the LMD02 (setup almost exactly the same way as 
LMD01) doesn't have this problem and always works fine on the private network.  
Deleting the "bad" peer address on the client doesn't help since it just 
re-appears as soon as the client tries to access Lustre.  Any ideas about what 
could be providing this "bad" IP and how we can remove it?   

FYI, I even tried "adding" tcp1 (for the public NIC) to the lnet options on 
LMD01/LMD02 but clients still fail since the request is coming in as 
'192.41.230...@tcp'  and not as '192.41.230...@tcp1'.

Thanks for any help or pointers to what might be wrong.

Shawn McKee/University of Michigan Physics


_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to