[lustre-discuss] lctl replace_nids example

2016-07-21 Thread Patricia Santos Marco
Hello,


Recently a new gigabit network has been added our lustre servers and we
want  to add this network to the lustre system.

 In Lustre 2.4 there is a "lctl replace_nids" command that in theory
allows  to change the NIDs without running --writeconf. However I don't
find any example to how I can use this command. Anybody have used it for
this purpose?

For more info:

We added the new interface in lnet configuration (tcp0):
/etc/modprobe.d/lustre_interface.conf
options lnet networks=o2ib0(ib0),tcp0(eth2)

and we see both interfaces:
 lnetctl net show
net:
- net: lo
  nid: 0@lo
  status: up
- net: o2ib
  nid: 192.168.2.252@o2ib
  status: up
  interfaces:
  0: ib0
- net: tcp
  nid: 192.168.3.252@tcp
  status: up
  interfaces:
  0: eth2

[root@cmds ~]# lctl list_nids
192.168.2.252@o2ib
192.168.3.252@tcp

However new clients aren't able to connect lustre servers via the new
interface.

 LustreError: 15c-8: MGC192.168.3.252@tcp: The configuration from log
'LUSTRE-client' failed (-2). This may be the result of communication errors
between this node and the MGS, a bad configuration, or other errors. See
the syslog for more information.

Thanks!






Patricia Santos Marco

HPC research group System Administrator

Instituto de Biocomputación y Física de Sistemas Complejos (BIFI)

Universidad de Zaragoza

e-mail: psan...@bifi.es 

phone: (+34) 976762992

http://bifi.es/~patricia/
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ​luster client mount issues

2016-07-21 Thread Martin Hecht
Hi,

I think your client doesn't have the o2ib lnet (it should appear in the
output of the lctl ping, even if you ping on the tcp lnet).
In your /etc/modprobe.d/lustre.conf o2ib is associated with the ib0
interface, but your /var/log/messages talks about ib1.
If it is a dual port card where just one port is used, the easiest would
be to plug the cable to the other interface. (If there are two ib
connections, things might become a bit more complicated. There are
examples for multi rail configurations using several lnets in the lustre
manual, but maybe this goes too far.)

With the attempt to mount via tcp (or tcp0, which is the same) I think
the problem is that the file system config on the mgs doesn't contain
the tcp-NIDs and/or the routes are not configured correctly. It seems
the attempt to mount via tcp causes the client to use o2ib for the
connections to the MDS and OSSes. So, I would recommend to get that
working first and then look at tcp0 at a later stage (if you need it at
all - native o2ib is more performant).

Last but not least I have noticed a typo in your client mount command:
mount -t lustre 192.168.200.52@ob2:/mylustre /lustre
this should be "o2ib" here, too.

best regards,
Martin

On 07/20/2016 08:09 PM, sohamm wrote:
> Hi
>
> Any guidance/help on this is greatly appreciated.
>
> Thanks
>
> On Mon, Jul 18, 2016 at 7:25 PM, sohamm  wrote:
>
>> Hi Ben
>> Both the networks have netmasks of value 255.255.255.0
>>
>> Thanks
>>
>> On Mon, Jul 18, 2016 at 10:08 AM, Ben Evans  wrote:
>>
>>> What do your netmasks look like on each network?
>>>
>>> From: lustre-discuss  on behalf
>>> of sohamm 
>>> Date: Monday, July 18, 2016 at 1:56 AM
>>> To: "lustre-discuss@lists.lustre.org" 
>>> Subject: Re: [lustre-discuss] lustre-discuss Digest, Vol 124, Issue 17
>>>
>>> Hi Thomas
>>> Below are the results of the commands you suggested.
>>>
>>> *From Client*
>>> [root@dev1 ~]# lctl ping 192.168.200.52@o2ib
>>> failed to ping 192.168.200.52@o2ib: Input/output error
>>> [root@dev1 ~]# lctl ping 192.168.111.52@tcp
>>> 12345-0@lo
>>> 12345-192.168.200.52@o2ib
>>> 12345-192.168.111.52@tcp
>>> [root@dev1 ~]# mount -t lustre 192.168.111.52@tcp:/mylustre /lustre
>>> mount.lustre: mount 192.168.111.52@tcp:/mylustre at /lustre failed:
>>> Input/output error
>>> Is the MGS running?
>>> mount: mounting 192.168.111.52@tcp:/mylustre on /lustre failed: Invalid
>>> argument
>>>
>>> cat /var/log/messages | tail
>>> Jul 18 01:37:04 dev1 user.warn kernel: [2250504.401397] ib1: multicast
>>> join failed for ff12:401b::::::, status -22
>>> Jul 18 01:37:26 dev1 user.warn kernel: [2250526.257309] LNet: No route to
>>> 12345-192.168.200.52@o2ib via  (all routers down)
>>> Jul 18 01:37:36 dev1 user.warn kernel: [2250536.481862] ib1: multicast
>>> join failed for ff12:401b::::::, status -22
>>> Jul 18 01:41:53 dev1 user.warn kernel: [2250792.947299] LNet: No route to
>>> 12345-192.168.200.52@o2ib via  (all routers down)
>>>
>>>
>>> *From MGS*
>>> [root@lustre_mgs01_vm03 ~]# lctl ping 192.168.111.102@tcp
>>> 12345-0@lo
>>> 12345-192.168.111.102@tcp
>>>
>>> Please let me know what else i can try. Looks like i am missing something
>>> with the ib config? Do i need router setup as part of lnet ?
>>> if i am able to ping mgs from client on the tcp network, it should still
>>> work ?
>>>
>>> Thanks
>>>
>>>
>>> On Sun, Jul 17, 2016 at 1:07 PM, 
 To: "lustre-discuss@lists.lustre.org"
 
 Subject: [lustre-discuss] llapi_file_get_stripe() and
 /proc/fs/lustre/osc/entries
 Message-ID: <03ceaaa0-b004-ae43-eaa1-437da2a5b...@iodoctors.com>
 Content-Type: text/plain; charset="utf-8"; Format="flowed"

 I am using