Hi John

We try to route infiniband traffic. The IP traffic is routed separately.
The two clusters we try to connect are configured differently, one with IP over IB the other one with dedicated ethernet adapters.

Jan Erik



On 02/27/2018 10:17 AM, John Hearns wrote:
Jan Erik,
    Can you clarify if you are routing IP traffic between the two Infiniband 
networks.
Or are you routing Infiniband traffic?


If I can be of help I manage an Infiniband network which connects to other IP 
networks using Mellanox VPI gateways, which proxy arp between IB and Ethernet. 
But I am not running GPFS traffic over these.



-----Original Message-----
From: gpfsug-discuss-boun...@spectrumscale.org 
[mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Sundermann, Jan 
Erik (SCC)
Sent: Monday, February 26, 2018 5:39 PM
To: gpfsug-discuss@spectrumscale.org
Subject: [gpfsug-discuss] Problems with remote mount via routed IB


Dear all

we are currently trying to remote mount a file system in a routed Infiniband 
test setup and face problems with dropped RDMA connections. The setup is the 
following:

- Spectrum Scale Cluster 1 is setup on four servers which are connected to the 
same infiniband network. Additionally they are connected to a fast ethernet 
providing ip communication in the network 192.168.11.0/24.

- Spectrum Scale Cluster 2 is setup on four additional servers which are 
connected to a second infiniband network. These servers have IPs on their IB 
interfaces in the network 192.168.12.0/24.

- IP is routed between 192.168.11.0/24 and 192.168.12.0/24 on a dedicated 
machine.

- We have a dedicated IB hardware router connected to both IB subnets.


We tested that the routing, both IP and IB, is working between the two clusters 
without problems and that RDMA is working fine both for internal communication 
inside cluster 1 and cluster 2

When trying to remote mount a file system from cluster 1 in cluster 2, RDMA 
communication is not working as expected. Instead we see error messages on the 
remote host (cluster 2)


2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to 192.168.11.4 
(iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 2
2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to 192.168.11.4 
(iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 2
2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to 192.168.11.1 
(iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 
index 3
2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to 192.168.11.1 
(iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 3
2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to 192.168.11.3 
(iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 
index 1
2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to 192.168.11.1 
(iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3
2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to 192.168.11.3 
(iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 1
2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to 192.168.11.3 
(iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 1
2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to 192.168.11.2 
(iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 
index 0
2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to 192.168.11.2 
(iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 0
2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to 192.168.11.2 
(iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 0
2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to 192.168.11.4 
(iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 
index 2
2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to 192.168.11.4 
(iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 2
2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to 192.168.11.4 
(iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 2
2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to 192.168.11.1 
(iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 
index 3
2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to 192.168.11.1 
(iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 3
2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to 192.168.11.1 
(iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3


and in the cluster with the file system (cluster 1)

2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error 
IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in 
gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129
2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to 192.168.12.5 
(iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to 
RDMA read error IBV_WC_RETRY_EXC_ERR index 3
2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and connected to 
192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 
fabnum 0 sl 0 index 3
2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error 
IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in 
gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129
2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to 192.168.12.5 
(iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to 
RDMA read error IBV_WC_RETRY_EXC_ERR index 3
2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and connected to 
192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 
fabnum 0 sl 0 index 3
2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error 
IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in 
gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129
2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to 192.168.12.5 
(iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to 
RDMA read error IBV_WC_RETRY_EXC_ERR index 3
2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and connected to 
192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 
fabnum 0 sl 0 index 3
2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error 
IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in 
gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129
2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to 192.168.12.5 
(iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to 
RDMA read error IBV_WC_RETRY_EXC_ERR index 3
2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and connected to 
192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 
fabnum 0 sl 0 index 3
2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error 
IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in 
gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129
2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to 192.168.12.5 
(iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to 
RDMA read error IBV_WC_RETRY_EXC_ERR index 3
2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and connected to 
192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 
fabnum 0 sl 0 index 3



Any advice on how to configure the setup in a way that would allow the remote 
mount via routed IB would be very appreciated.


Thank you and best regards
Jan Erik


-- The information contained in this communication and any attachments is 
confidential and may be privileged, and is for the sole use of the intended 
recipient(s). Any unauthorized review, use, disclosure or distribution is 
prohibited. Unless explicitly stated otherwise in the body of this 
communication or the attachment thereto (if any), the information is provided 
on an AS-IS basis without any express or implied warranties or liabilities. To 
the extent you are relying on this information, you are doing so at your own 
risk. If you are not the intended recipient, please notify the sender 
immediately by replying to this message and destroy all copies of this message 
and any attachments. Neither the sender nor the company/group of companies he 
or she represents shall be liable for the proper and complete transmission of 
the information contained in this communication, or for any delay in its 
receipt.
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


--

Karlsruhe Institute of Technology (KIT)
Steinbuch Centre for Computing (SCC)

Jan Erik Sundermann

Hermann-von-Helmholtz-Platz 1, Building 449, Room 226
D-76344 Eggenstein-Leopoldshafen

Tel: +49 721 608 26191
Email: jan.sunderm...@kit.edu
www.scc.kit.edu

KIT – The Research University in the Helmholtz Association

Since 2010, KIT has been certified as a family-friendly university.

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to