all server and client that fore-mentioned is using netmasks
255.255.255.224.  and they can ping with each other, for example:

root@ml-gpu-ser200.nmg01:~$ ping node28
PING node28 (10.82.143.202) 56(84) bytes of data.
64 bytes from node28 (10.82.143.202): icmp_seq=1 ttl=61 time=0.047 ms
64 bytes from node28 (10.82.143.202): icmp_seq=2 ttl=61 time=0.028 ms

--- node28 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.028/0.037/0.047/0.011 ms
root@ml-gpu-ser200.nmg01:~$ lctl ping node28@o2ib1
failed to ping 10.82.143.202@o2ib1: Input/output error
root@ml-gpu-ser200.nmg01:~$

 and we also have hundreds of GPU machines with different IP Subnet,  they
are in service and it's difficulty to change the network structure. so any
material or document can guide me solve this by don't change network
structure.

Thanks
Yu

Mohr Jr, Richard Frank (Rick Mohr) <rm...@utk.edu> 于2018年6月29日周五 上午3:30写道:

>
> > On Jun 27, 2018, at 4:44 PM, Mohr Jr, Richard Frank (Rick Mohr) <
> rm...@utk.edu> wrote:
> >
> >
> >> On Jun 27, 2018, at 3:12 AM, yu sun <sunyu1...@gmail.com> wrote:
> >>
> >> client:
> >> root@ml-gpu-ser200.nmg01:~$ mount -t lustre 
> >> node28@o2ib1:node29@o2ib1:/project
> /mnt/lustre_data
> >> mount.lustre: mount node28@o2ib1:node29@o2ib1:/project at
> /mnt/lustre_data failed: Input/output error
> >> Is the MGS running?
> >> root@ml-gpu-ser200.nmg01:~$ lctl ping node28@o2ib1
> >> failed to ping 10.82.143.202@o2ib1: Input/output error
> >> root@ml-gpu-ser200.nmg01:~$
> >
> > In your previous email, you said that you could mount lustre on the
> client ml-gpu-ser200.nmg01.  Was that not accurate, or did something change
> in the meantime?
>
> (Note: Received out-of-band reply from Yu stating that there was a typo in
> the previous email, and that client ml-gpu-ser200.nmg01 could not mount
> lustre.  Continuing discussion here so others on list can follow/benefit.)
>
> Yu,
>
> For the IPoIB addresses used on your nodes, what are the subnets (and
> netmasks) that you are using?  It looks like servers use 10.82.143.X and
> clients use 10.82.141.X.  If you are using a 255.255.0.0 netmask, you
> should be fine.  But if you are using 255.255.255.0, then you will run into
> problems.  Lustre expects that all nodes on the same lnet network (o2ib1 in
> your case) will also be on the same IP subnet.
>
> Have you tried running a regular “ping <IPoIB_address>” command between
> clients and servers to make sure that part is working?
>
> --
> Rick Mohr
> Senior HPC System Administrator
> National Institute for Computational Sciences
> http://www.nics.tennessee.edu
>
>
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to