Hi Rick, Even without iptables rules and loading the correct modules afterwards, we get the same results:
[root@pg-gpu01 sysconfig]# iptables --list Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination Chain LOGDROP (0 references) target prot opt source destination LOG all -- anywhere anywhere LOG level warning DROP all -- anywhere anywhere [root@pg-gpu01 sysconfig]# modprobe lnet [root@pg-gpu01 sysconfig]# modprobe lustre [root@pg-gpu01 sysconfig]# lctl ping 172.23.55.211@o2ib failed to ping 172.23.55.211@o2ib: Input/output error On Mon, Apr 24, 2017 at 4:59 PM, Mohr Jr, Richard Frank (Rick Mohr) < [email protected]> wrote: > This might be a long shot, but have you checked for possible firewall > rules that might be causing the issue? I’m wondering if there is a chance > that some rules were added after the nodes were up to allow Lustre access, > and when a node got rebooted, it lost the rules. > > -- > Rick Mohr > Senior HPC System Administrator > National Institute for Computational Sciences > http://www.nics.tennessee.edu > > > > On Apr 24, 2017, at 10:19 AM, Strikwerda, Ger <[email protected]> > wrote: > > > > Hi Russell, > > > > Thanks for the IB subnet clues: > > > > [root@pg-gpu01 ~]# ibv_devinfo > > hca_id: mlx4_0 > > transport: InfiniBand (0) > > fw_ver: 2.32.5100 > > node_guid: f452:1403:00f5:4620 > > sys_image_guid: f452:1403:00f5:4623 > > vendor_id: 0x02c9 > > vendor_part_id: 4099 > > hw_ver: 0x1 > > board_id: MT_1100120019 > > phys_port_cnt: 1 > > port: 1 > > state: PORT_ACTIVE (4) > > max_mtu: 4096 (5) > > active_mtu: 4096 (5) > > sm_lid: 1 > > port_lid: 185 > > port_lmc: 0x00 > > link_layer: InfiniBand > > > > [root@pg-gpu01 ~]# sminfo > > sminfo: sm lid 1 sm guid 0xf452140300f62320, activity count 80878098 > priority 0 state 3 SMINFO_MASTER > > > > Looks like the rebooted node is able to connect/contact IB/IB > subnetmanager > > > > > > > > > > On Mon, Apr 24, 2017 at 4:14 PM, Russell Dekema <[email protected]> > wrote: > > At first glance, this sounds like your Infiniband subnet manager may > > be down or malfunctioning. In this case, nodes which were already up > > when the subnet manager was working will continue to be able to > > communicate over IB, but nodes which reboot after the SM goes down > > will not. > > > > You can test this theory by running the 'ibv_devinfo' command on one > > of your rebooted nodes. If the relevant IB port is in state PORT_INIT, > > this confirms there is a problem with your subnet manager. > > > > Sincerely, > > Rusty Dekema > > > > > > > > > > On Mon, Apr 24, 2017 at 9:57 AM, Strikwerda, Ger > > <[email protected]> wrote: > > > Hi everybody, > > > > > > Here at the university of Groningen we are now experiencing a strange > Lustre > > > error. If a client reboots, it fails to mount the Lustre storage. The > client > > > is not able to reach the MSG service. The storage and nodes are > > > communicating over IB and unitil now without any problems. It looks > like an > > > issue inside LNET. Clients cannot LNET ping/connect the metadata and or > > > storage. But the clients are able to LNET ping each other. Clients > which not > > > have been rebooted, are working fine and have their mounts on our > Lustre > > > filesystem. > > > > > > Lustre client log: > > > > > > Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32- > 573.el6.x86_64 > > > LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180] > > > > > > LustreError: 15c-8: MGC172.23.55.211@o2ib: The configuration from log > > > 'pgdata01-client' failed (-5). This may be the result of communication > > > errors between this node and the MGS, a bad configuration, or other > errors. > > > See the syslog for more information. > > > LustreError: 3812:0:(llite_lib.c:1046:ll_fill_super()) Unable to > process > > > log: -5 > > > Lustre: Unmounted pgdata01-client > > > LustreError: 3812:0:(obd_mount.c:1325:lustre_fill_super()) Unable to > mount > > > (-5) > > > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) > 172.23.55.212@o2ib > > > rejected: consumer defined fatal error > > > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 > previous > > > similar message > > > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ > Request sent > > > has failed due to network error: [sent 1492789626/real 1492789626] > > > req@ffff88105af2cc00 x1565303228072004/t0(0) > > > o250->MGC172.23.55.211@[email protected]@o2ib:26/25 lens 400/544 e 0 > to 1 > > > dl 1492789631 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 > > > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) Skipped 1 > > > previous similar message > > > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ > send limit > > > expired req@ffff882041ffc000 x1565303228071996/t0(0) > > > o101->MGC172.23.55.211@[email protected]@o2ib:26/25 lens 328/344 e 0 > to 0 > > > dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 > > > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped > 2 > > > previous similar messages > > > LustreError: 15c-8: MGC172.23.55.211@o2ib: The configuration from log > > > 'pghome01-client' failed (-5). This may be the result of communication > > > errors between this node and the MGS, a bad configuration, or other > errors. > > > See the syslog for more information. > > > LustreError: 3826:0:(llite_lib.c:1046:ll_fill_super()) Unable to > process > > > log: -5 > > > > > > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) > 172.23.55.212@o2ib > > > rejected: consumer defined fatal error > > > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 > previous > > > similar message > > > LNet: 3755:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from > > > 172.23.55.211@o2ib failed: 5 > > > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) > 172.23.55.211@o2ib > > > rejected: consumer defined fatal error > > > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 > previous > > > similar message > > > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting > > > messages for 172.23.55.211@o2ib: connection failed > > > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting > > > messages for 172.23.55.212@o2ib: connection failed > > > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from > > > 172.23.55.212@o2ib failed: 5 > > > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Skipped 17 > previous > > > similar messages > > > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting > > > messages for 172.23.55.211@o2ib: connection failed > > > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from > > > 172.23.55.212@o2ib failed: 5 > > > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting > > > messages for 172.23.55.212@o2ib: connection failed > > > > > > LNET ping of a metadata-node: > > > > > > [root@pg-gpu01 ~]# lctl ping 172.23.55.211@o2ib > > > failed to ping 172.23.55.211@o2ib: Input/output error > > > > > > LNET ping of the number 2 metadata-node: > > > > > > [root@pg-gpu01 ~]# lctl ping 172.23.55.212@o2ib > > > failed to ping 172.23.55.212@o2ib: Input/output error > > > > > > LNET ping of a random compute-node: > > > > > > [root@pg-gpu01 ~]# lctl ping 172.23.52.5@o2ib > > > 12345-0@lo > > > 12345-172.23.52.5@o2ib > > > > > > LNET to OST01: > > > > > > [root@pg-gpu01 ~]# lctl ping 172.23.55.201@o2ib > > > failed to ping 172.23.55.201@o2ib: Input/output error > > > > > > LNET to OST02: > > > > > > [root@pg-gpu01 ~]# lctl ping 172.23.55.202@o2ib > > > failed to ping 172.23.55.202@o2ib: Input/output error > > > > > > 'normal' pings (on ip level) works fine: > > > > > > [root@pg-gpu01 ~]# ping 172.23.55.201 > > > PING 172.23.55.201 (172.23.55.201) 56(84) bytes of data. > > > 64 bytes from 172.23.55.201: icmp_seq=1 ttl=64 time=0.741 ms > > > > > > [root@pg-gpu01 ~]# ping 172.23.55.202 > > > PING 172.23.55.202 (172.23.55.202) 56(84) bytes of data. > > > 64 bytes from 172.23.55.202: icmp_seq=1 ttl=64 time=0.704 ms > > > > > > lctl on a rebooted node: > > > > > > [root@pg-gpu01 ~]# lctl dl > > > > > > lctl on a not rebooted node: > > > > > > [root@pg-node005 ~]# lctl dl > > > 0 UP mgc MGC172.23.55.211@o2ib 94bd1c8a-512f-b920-9a4e-a6aced3d386d > 5 > > > 1 UP lov pgtemp01-clilov-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 4 > > > 2 UP lmv pgtemp01-clilmv-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 4 > > > 3 UP mdc pgtemp01-MDT0000-mdc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 4 UP osc pgtemp01-OST0001-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 5 UP osc pgtemp01-OST0003-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 6 UP osc pgtemp01-OST0005-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 7 UP osc pgtemp01-OST0007-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 8 UP osc pgtemp01-OST0009-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 9 UP osc pgtemp01-OST000b-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 10 UP osc pgtemp01-OST000d-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 11 UP osc pgtemp01-OST000f-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 12 UP osc pgtemp01-OST0011-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 13 UP osc pgtemp01-OST0002-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 14 UP osc pgtemp01-OST0004-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 15 UP osc pgtemp01-OST0006-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 16 UP osc pgtemp01-OST0008-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 17 UP osc pgtemp01-OST000a-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 18 UP osc pgtemp01-OST000c-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 19 UP osc pgtemp01-OST000e-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 20 UP osc pgtemp01-OST0010-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 21 UP osc pgtemp01-OST0012-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 22 UP osc pgtemp01-OST0013-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 23 UP osc pgtemp01-OST0015-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 24 UP osc pgtemp01-OST0017-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 25 UP osc pgtemp01-OST0014-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 26 UP osc pgtemp01-OST0016-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 27 UP osc pgtemp01-OST0018-osc-ffff88206906d400 > > > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > > > 28 UP lov pgdata01-clilov-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 4 > > > 29 UP lmv pgdata01-clilmv-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 4 > > > 30 UP mdc pgdata01-MDT0000-mdc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 31 UP osc pgdata01-OST0001-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 32 UP osc pgdata01-OST0003-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 33 UP osc pgdata01-OST0005-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 34 UP osc pgdata01-OST0007-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 35 UP osc pgdata01-OST0009-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 36 UP osc pgdata01-OST000b-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 37 UP osc pgdata01-OST000d-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 38 UP osc pgdata01-OST000f-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 39 UP osc pgdata01-OST0002-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 40 UP osc pgdata01-OST0004-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 41 UP osc pgdata01-OST0006-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 42 UP osc pgdata01-OST0008-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 43 UP osc pgdata01-OST000a-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 44 UP osc pgdata01-OST000c-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 45 UP osc pgdata01-OST000e-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 46 UP osc pgdata01-OST0010-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 47 UP osc pgdata01-OST0013-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 48 UP osc pgdata01-OST0015-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 49 UP osc pgdata01-OST0017-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 50 UP osc pgdata01-OST0014-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 51 UP osc pgdata01-OST0016-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 52 UP osc pgdata01-OST0018-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 53 UP osc pgdata01-OST0019-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 54 UP osc pgdata01-OST001a-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 55 UP osc pgdata01-OST001b-osc-ffff88204bab6400 > > > 996b1742-82eb-281c-c322-e244672d5225 5 > > > 56 UP lov pghome01-clilov-ffff88204bb50000 > > > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4 > > > 57 UP lmv pghome01-clilmv-ffff88204bb50000 > > > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4 > > > 58 UP mdc pghome01-MDT0000-mdc-ffff88204bb50000 > > > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5 > > > 59 UP osc pghome01-OST0011-osc-ffff88204bb50000 > > > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5 > > > 60 UP osc pghome01-OST0012-osc-ffff88204bb50000 > > > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5 > > > > > > Please help, any clues/advice/hints/tips are appricated > > > > > > -- > > > > > > Vriendelijke groet, > > > > > > Ger Strikwerda > > > Chef Special > > > Rijksuniversiteit Groningen > > > Centrum voor Informatie Technologie > > > Unit Pragmatisch Systeembeheer > > > > > > Smitsborg > > > Nettelbosje 1 > > > 9747 AJ Groningen > > > Tel. 050 363 9276 > > > > > > "God is hard, God is fair > > > some men he gave brains, others he gave hair" > > > > > > > > > _______________________________________________ > > > lustre-discuss mailing list > > > [email protected] > > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > > > > > > > > > > > -- > > Vriendelijke groet, > > > > Ger Strikwerda > > > > Chef Special > > Rijksuniversiteit Groningen > > Centrum voor Informatie Technologie > > Unit Pragmatisch Systeembeheer > > > > Smitsborg > > Nettelbosje 1 > > 9747 AJ Groningen > > Tel. 050 363 9276 > > > > > > "God is hard, God is fair > > some men he gave brains, others he gave hair" > > _______________________________________________ > > lustre-discuss mailing list > > [email protected] > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > > > -- Vriendelijke groet, Ger StrikwerdaChef Special Rijksuniversiteit Groningen Centrum voor Informatie Technologie Unit Pragmatisch Systeembeheer Smitsborg Nettelbosje 1 9747 AJ Groningen Tel. 050 363 9276 "God is hard, God is fair some men he gave brains, others he gave hair"
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
