[lustre-discuss] Lustre 2.11 lnet troubleshooting

2018-04-17 Thread Faaland, Olaf P.
Hi,

I've got a cluster running 2.11 with 2 routers and 68  compute nodes.  It's the 
first time I've used a post-multi-rail version of Lustre.  

The problem I'm trying to troubleshoot is that my sample compute node (ulna66) 
seems to think the router I configured (ulna4) is down, and so an attempt to 
ping outside the cluster results in failure and "no route to XXX" on the 
console.  I can lctl ping the router from the compute node and vice-versa.   
Forwarding is enabled on the router node via modprobe argument.

lnetctl route show reports that the route is down.  Where I'm stuck is figuring 
out what in userspace (e.g. lnetctl or lctl) can tell me why.

The compute node's lnet configuration is:

[root@ulna66:lustre-211]# cat /etc/lnet.conf
ip2nets:
  - net-spec: o2ib33
interfaces:
 0: hsi0
ip-range:
 0: 192.168.128.*
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33

After I start lnet, systemctl reports success and the state is as follows:

[root@ulna66:lustre-211]# lnetctl net show
net:
- net type: lo
  local NI(s):
- nid: 0@lo
  status: up
- net type: o2ib33
  local NI(s):
- nid: 192.168.128.66@o2ib33
  status: up
  interfaces:
  0: hsi0

[root@ulna66:lustre-211]# lnetctl peer show --verbose
peer:
- primary nid: 192.168.128.4@o2ib33
  Multi-Rail: False
  peer ni:
- nid: 192.168.128.4@o2ib33
  state: up
  max_ni_tx_credits: 8
  available_tx_credits: 8
  min_tx_credits: 7
  tx_q_num_of_buf: 0
  available_rtr_credits: 8
  min_rtr_credits: 8
  refcount: 4
  statistics:
  send_count: 2
  recv_count: 2
  drop_count: 0

[root@ulna66:lustre-211]# lnetctl route show --verbose
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33
  hop: -1
  priority: 0
  state: down

I can instrument the code, but I figure there must be someplace available to 
normal users to look, that I'm unaware of.

thanks,

Olaf P. Faaland
Livermore Computing
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.11 lnet troubleshooting

2018-04-17 Thread Faaland, Olaf P.
Update:

Joe pointed out "lnetctl set routing 1".  After invoking that on the router 
node, the compute node reports the route as up:

[root@ulna66:lustre-211]# lnetctl route show -v
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33
  hop: -1
  priority: 0
  state: up

Does this replace the lnet module parameter "forwarding"?

Olaf P. Faaland
Livermore Computing



From: lustre-discuss  on behalf of 
Faaland, Olaf P. 
Sent: Tuesday, April 17, 2018 4:34:22 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Lustre 2.11 lnet troubleshooting

Hi,

I've got a cluster running 2.11 with 2 routers and 68  compute nodes.  It's the 
first time I've used a post-multi-rail version of Lustre.

The problem I'm trying to troubleshoot is that my sample compute node (ulna66) 
seems to think the router I configured (ulna4) is down, and so an attempt to 
ping outside the cluster results in failure and "no route to XXX" on the 
console.  I can lctl ping the router from the compute node and vice-versa.   
Forwarding is enabled on the router node via modprobe argument.

lnetctl route show reports that the route is down.  Where I'm stuck is figuring 
out what in userspace (e.g. lnetctl or lctl) can tell me why.

The compute node's lnet configuration is:

[root@ulna66:lustre-211]# cat /etc/lnet.conf
ip2nets:
  - net-spec: o2ib33
interfaces:
 0: hsi0
ip-range:
 0: 192.168.128.*
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33

After I start lnet, systemctl reports success and the state is as follows:

[root@ulna66:lustre-211]# lnetctl net show
net:
- net type: lo
  local NI(s):
- nid: 0@lo
  status: up
- net type: o2ib33
  local NI(s):
- nid: 192.168.128.66@o2ib33
  status: up
  interfaces:
  0: hsi0

[root@ulna66:lustre-211]# lnetctl peer show --verbose
peer:
- primary nid: 192.168.128.4@o2ib33
  Multi-Rail: False
  peer ni:
- nid: 192.168.128.4@o2ib33
  state: up
  max_ni_tx_credits: 8
  available_tx_credits: 8
  min_tx_credits: 7
  tx_q_num_of_buf: 0
  available_rtr_credits: 8
  min_rtr_credits: 8
  refcount: 4
  statistics:
  send_count: 2
  recv_count: 2
  drop_count: 0

[root@ulna66:lustre-211]# lnetctl route show --verbose
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33
  hop: -1
  priority: 0
  state: down

I can instrument the code, but I figure there must be someplace available to 
normal users to look, that I'm unaware of.

thanks,

Olaf P. Faaland
Livermore Computing
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.11 lnet troubleshooting

2018-04-17 Thread Alexander I Kulyavtsev
To original question: lnetctl on router node shows ‘enable: 1 ’ 

# lnetctl routing show
routing:
- cpt[0]:
 …snip…
- enable: 1

Lustre 2.10.3-1.el6

Alex.

On 4/17/18, 7:05 PM, "lustre-discuss on behalf of Faaland, Olaf P." 
 wrote:

Update:

Joe pointed out "lnetctl set routing 1".  After invoking that on the router 
node, the compute node reports the route as up:

[root@ulna66:lustre-211]# lnetctl route show -v
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33
  hop: -1
  priority: 0
  state: up

Does this replace the lnet module parameter "forwarding"?

Olaf P. Faaland
Livermore Computing



From: lustre-discuss  on behalf of 
Faaland, Olaf P. 
Sent: Tuesday, April 17, 2018 4:34:22 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Lustre 2.11 lnet troubleshooting

Hi,

I've got a cluster running 2.11 with 2 routers and 68  compute nodes.  It's 
the first time I've used a post-multi-rail version of Lustre.

The problem I'm trying to troubleshoot is that my sample compute node 
(ulna66) seems to think the router I configured (ulna4) is down, and so an 
attempt to ping outside the cluster results in failure and "no route to XXX" on 
the console.  I can lctl ping the router from the compute node and vice-versa.  
 Forwarding is enabled on the router node via modprobe argument.

lnetctl route show reports that the route is down.  Where I'm stuck is 
figuring out what in userspace (e.g. lnetctl or lctl) can tell me why.

The compute node's lnet configuration is:

[root@ulna66:lustre-211]# cat /etc/lnet.conf
ip2nets:
  - net-spec: o2ib33
interfaces:
 0: hsi0
ip-range:
 0: 192.168.128.*
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33

After I start lnet, systemctl reports success and the state is as follows:

[root@ulna66:lustre-211]# lnetctl net show
net:
- net type: lo
  local NI(s):
- nid: 0@lo
  status: up
- net type: o2ib33
  local NI(s):
- nid: 192.168.128.66@o2ib33
  status: up
  interfaces:
  0: hsi0

[root@ulna66:lustre-211]# lnetctl peer show --verbose
peer:
- primary nid: 192.168.128.4@o2ib33
  Multi-Rail: False
  peer ni:
- nid: 192.168.128.4@o2ib33
  state: up
  max_ni_tx_credits: 8
  available_tx_credits: 8
  min_tx_credits: 7
  tx_q_num_of_buf: 0
  available_rtr_credits: 8
  min_rtr_credits: 8
  refcount: 4
  statistics:
  send_count: 2
  recv_count: 2
  drop_count: 0

[root@ulna66:lustre-211]# lnetctl route show --verbose
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33
  hop: -1
  priority: 0
  state: down

I can instrument the code, but I figure there must be someplace available 
to normal users to look, that I'm unaware of.

thanks,

Olaf P. Faaland
Livermore Computing
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.11 lnet troubleshooting

2018-04-17 Thread Faaland, Olaf P.
So the problem was inded that "routing" was disabled on the router node.  I 
added "routing: 1" to the lnet.conf file for the routers and lctl ping works as 
expected.

The question about the lnet module option "forwarding" still stands.  The lnet 
module still accepts a parameter, "forwarding", but it doesn't do what it used 
to.   Is that just a leftover that needs to be cleaned up?

thanks,

Olaf P. Faaland
Livermore Computing


From: Faaland, Olaf P.
Sent: Tuesday, April 17, 2018 5:05 PM
To: lustre-discuss@lists.lustre.org
Subject: Re: Lustre 2.11 lnet troubleshooting

Update:

Joe pointed out "lnetctl set routing 1".  After invoking that on the router 
node, the compute node reports the route as up:

[root@ulna66:lustre-211]# lnetctl route show -v
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33
  hop: -1
  priority: 0
  state: up

Does this replace the lnet module parameter "forwarding"?

Olaf P. Faaland
Livermore Computing



From: lustre-discuss  on behalf of 
Faaland, Olaf P. 
Sent: Tuesday, April 17, 2018 4:34:22 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Lustre 2.11 lnet troubleshooting

Hi,

I've got a cluster running 2.11 with 2 routers and 68  compute nodes.  It's the 
first time I've used a post-multi-rail version of Lustre.

The problem I'm trying to troubleshoot is that my sample compute node (ulna66) 
seems to think the router I configured (ulna4) is down, and so an attempt to 
ping outside the cluster results in failure and "no route to XXX" on the 
console.  I can lctl ping the router from the compute node and vice-versa.   
Forwarding is enabled on the router node via modprobe argument.

lnetctl route show reports that the route is down.  Where I'm stuck is figuring 
out what in userspace (e.g. lnetctl or lctl) can tell me why.

The compute node's lnet configuration is:

[root@ulna66:lustre-211]# cat /etc/lnet.conf
ip2nets:
  - net-spec: o2ib33
interfaces:
 0: hsi0
ip-range:
 0: 192.168.128.*
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33

After I start lnet, systemctl reports success and the state is as follows:

[root@ulna66:lustre-211]# lnetctl net show
net:
- net type: lo
  local NI(s):
- nid: 0@lo
  status: up
- net type: o2ib33
  local NI(s):
- nid: 192.168.128.66@o2ib33
  status: up
  interfaces:
  0: hsi0

[root@ulna66:lustre-211]# lnetctl peer show --verbose
peer:
- primary nid: 192.168.128.4@o2ib33
  Multi-Rail: False
  peer ni:
- nid: 192.168.128.4@o2ib33
  state: up
  max_ni_tx_credits: 8
  available_tx_credits: 8
  min_tx_credits: 7
  tx_q_num_of_buf: 0
  available_rtr_credits: 8
  min_rtr_credits: 8
  refcount: 4
  statistics:
  send_count: 2
  recv_count: 2
  drop_count: 0

[root@ulna66:lustre-211]# lnetctl route show --verbose
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33
  hop: -1
  priority: 0
  state: down

I can instrument the code, but I figure there must be someplace available to 
normal users to look, that I'm unaware of.

thanks,

Olaf P. Faaland
Livermore Computing
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.11 lnet troubleshooting

2018-04-18 Thread Dilger, Andreas
On Apr 17, 2018, at 19:00, Faaland, Olaf P.  wrote:
> 
> So the problem was inded that "routing" was disabled on the router node.  I 
> added "routing: 1" to the lnet.conf file for the routers and lctl ping works 
> as expected.
> 
> The question about the lnet module option "forwarding" still stands.  The 
> lnet module still accepts a parameter, "forwarding", but it doesn't do what 
> it used to.   Is that just a leftover that needs to be cleaned up?

I would say that the module parameter should continue to work, and be 
equivalent to the "routing: 1" YAML parameter.  This facilitates upgrades.

Did you try this with 2.10 (which also has LNet Multi-Rail), or are you coming 
from 2.7 or 2.8?

I'd recommend to file a ticket in Jira for this.  I suspect it might also be 
broken in 2.10, and the fix should be backported there as well.

Cheers, Andreas

> 
> From: Faaland, Olaf P.
> Sent: Tuesday, April 17, 2018 5:05 PM
> To: lustre-discuss@lists.lustre.org
> Subject: Re: Lustre 2.11 lnet troubleshooting
> 
> Update:
> 
> Joe pointed out "lnetctl set routing 1".  After invoking that on the router 
> node, the compute node reports the route as up:
> 
> [root@ulna66:lustre-211]# lnetctl route show -v
> route:
>- net: o2ib100
>  gateway: 192.168.128.4@o2ib33
>  hop: -1
>  priority: 0
>  state: up
> 
> Does this replace the lnet module parameter "forwarding"?
> 
> Olaf P. Faaland
> Livermore Computing
> 
> 
> ________________
> From: lustre-discuss  on behalf of 
> Faaland, Olaf P. 
> Sent: Tuesday, April 17, 2018 4:34:22 PM
> To: lustre-discuss@lists.lustre.org
> Subject: [lustre-discuss] Lustre 2.11 lnet troubleshooting
> 
> Hi,
> 
> I've got a cluster running 2.11 with 2 routers and 68  compute nodes.  It's 
> the first time I've used a post-multi-rail version of Lustre.
> 
> The problem I'm trying to troubleshoot is that my sample compute node 
> (ulna66) seems to think the router I configured (ulna4) is down, and so an 
> attempt to ping outside the cluster results in failure and "no route to XXX" 
> on the console.  I can lctl ping the router from the compute node and 
> vice-versa.   Forwarding is enabled on the router node via modprobe argument.
> 
> lnetctl route show reports that the route is down.  Where I'm stuck is 
> figuring out what in userspace (e.g. lnetctl or lctl) can tell me why.
> 
> The compute node's lnet configuration is:
> 
> [root@ulna66:lustre-211]# cat /etc/lnet.conf
> ip2nets:
>  - net-spec: o2ib33
>interfaces:
> 0: hsi0
>ip-range:
> 0: 192.168.128.*
> route:
>- net: o2ib100
>  gateway: 192.168.128.4@o2ib33
> 
> After I start lnet, systemctl reports success and the state is as follows:
> 
> [root@ulna66:lustre-211]# lnetctl net show
> net:
>- net type: lo
>  local NI(s):
>- nid: 0@lo
>  status: up
>- net type: o2ib33
>  local NI(s):
>- nid: 192.168.128.66@o2ib33
>  status: up
>  interfaces:
>  0: hsi0
> 
> [root@ulna66:lustre-211]# lnetctl peer show --verbose
> peer:
>- primary nid: 192.168.128.4@o2ib33
>  Multi-Rail: False
>  peer ni:
>- nid: 192.168.128.4@o2ib33
>  state: up
>  max_ni_tx_credits: 8
>  available_tx_credits: 8
>  min_tx_credits: 7
>  tx_q_num_of_buf: 0
>  available_rtr_credits: 8
>  min_rtr_credits: 8
>  refcount: 4
>  statistics:
>  send_count: 2
>  recv_count: 2
>  drop_count: 0
> 
> [root@ulna66:lustre-211]# lnetctl route show --verbose
> route:
>- net: o2ib100
>  gateway: 192.168.128.4@o2ib33
>  hop: -1
>  priority: 0
>  state: down
> 
> I can instrument the code, but I figure there must be someplace available to 
> normal users to look, that I'm unaware of.
> 
> thanks,
> 
> Olaf P. Faaland
> Livermore Computing
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.11 lnet troubleshooting

2018-04-19 Thread Faaland, Olaf P.
I haven't tested 2.10 yet, but I may get a chance to today.  I created ticket

https://jira.hpdd.intel.com/browse/LU-10930

thanks,

Olaf P. Faaland
Livermore Computing


From: Dilger, Andreas 
Sent: Wednesday, April 18, 2018 8:37:43 PM
To: Faaland, Olaf P.
Cc: lustre-discuss@lists.lustre.org; Shehata, Amir
Subject: Re: [lustre-discuss] Lustre 2.11 lnet troubleshooting

On Apr 17, 2018, at 19:00, Faaland, Olaf P.  wrote:
>
> So the problem was inded that "routing" was disabled on the router node.  I 
> added "routing: 1" to the lnet.conf file for the routers and lctl ping works 
> as expected.
>
> The question about the lnet module option "forwarding" still stands.  The 
> lnet module still accepts a parameter, "forwarding", but it doesn't do what 
> it used to.   Is that just a leftover that needs to be cleaned up?

I would say that the module parameter should continue to work, and be 
equivalent to the "routing: 1" YAML parameter.  This facilitates upgrades.

Did you try this with 2.10 (which also has LNet Multi-Rail), or are you coming 
from 2.7 or 2.8?

I'd recommend to file a ticket in Jira for this.  I suspect it might also be 
broken in 2.10, and the fix should be backported there as well.

Cheers, Andreas

> 
> From: Faaland, Olaf P.
> Sent: Tuesday, April 17, 2018 5:05 PM
> To: lustre-discuss@lists.lustre.org
> Subject: Re: Lustre 2.11 lnet troubleshooting
>
> Update:
>
> Joe pointed out "lnetctl set routing 1".  After invoking that on the router 
> node, the compute node reports the route as up:
>
> [root@ulna66:lustre-211]# lnetctl route show -v
> route:
>- net: o2ib100
>  gateway: 192.168.128.4@o2ib33
>  hop: -1
>  priority: 0
>  state: up
>
> Does this replace the lnet module parameter "forwarding"?
>
> Olaf P. Faaland
> Livermore Computing
>
>
> ________________
> From: lustre-discuss  on behalf of 
> Faaland, Olaf P. 
> Sent: Tuesday, April 17, 2018 4:34:22 PM
> To: lustre-discuss@lists.lustre.org
> Subject: [lustre-discuss] Lustre 2.11 lnet troubleshooting
>
> Hi,
>
> I've got a cluster running 2.11 with 2 routers and 68  compute nodes.  It's 
> the first time I've used a post-multi-rail version of Lustre.
>
> The problem I'm trying to troubleshoot is that my sample compute node 
> (ulna66) seems to think the router I configured (ulna4) is down, and so an 
> attempt to ping outside the cluster results in failure and "no route to XXX" 
> on the console.  I can lctl ping the router from the compute node and 
> vice-versa.   Forwarding is enabled on the router node via modprobe argument.
>
> lnetctl route show reports that the route is down.  Where I'm stuck is 
> figuring out what in userspace (e.g. lnetctl or lctl) can tell me why.
>
> The compute node's lnet configuration is:
>
> [root@ulna66:lustre-211]# cat /etc/lnet.conf
> ip2nets:
>  - net-spec: o2ib33
>interfaces:
> 0: hsi0
>ip-range:
> 0: 192.168.128.*
> route:
>- net: o2ib100
>  gateway: 192.168.128.4@o2ib33
>
> After I start lnet, systemctl reports success and the state is as follows:
>
> [root@ulna66:lustre-211]# lnetctl net show
> net:
>- net type: lo
>  local NI(s):
>- nid: 0@lo
>  status: up
>- net type: o2ib33
>  local NI(s):
>- nid: 192.168.128.66@o2ib33
>  status: up
>  interfaces:
>  0: hsi0
>
> [root@ulna66:lustre-211]# lnetctl peer show --verbose
> peer:
>- primary nid: 192.168.128.4@o2ib33
>  Multi-Rail: False
>  peer ni:
>- nid: 192.168.128.4@o2ib33
>  state: up
>  max_ni_tx_credits: 8
>  available_tx_credits: 8
>  min_tx_credits: 7
>  tx_q_num_of_buf: 0
>  available_rtr_credits: 8
>  min_rtr_credits: 8
>  refcount: 4
>  statistics:
>  send_count: 2
>  recv_count: 2
>  drop_count: 0
>
> [root@ulna66:lustre-211]# lnetctl route show --verbose
> route:
>- net: o2ib100
>  gateway: 192.168.128.4@o2ib33
>  hop: -1
>  priority: 0
>  state: down
>
> I can instrument the code, but I figure there must be someplace available to 
> normal users to look, that I'm unaware of.
>
> thanks,
>
> Olaf P. Faaland
> Livermore Computing
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org