At first glance, this sounds like your Infiniband subnet manager may be down or malfunctioning. In this case, nodes which were already up when the subnet manager was working will continue to be able to communicate over IB, but nodes which reboot after the SM goes down will not.
You can test this theory by running the 'ibv_devinfo' command on one of your rebooted nodes. If the relevant IB port is in state PORT_INIT, this confirms there is a problem with your subnet manager. Sincerely, Rusty Dekema On Mon, Apr 24, 2017 at 9:57 AM, Strikwerda, Ger <[email protected]> wrote: > Hi everybody, > > Here at the university of Groningen we are now experiencing a strange Lustre > error. If a client reboots, it fails to mount the Lustre storage. The client > is not able to reach the MSG service. The storage and nodes are > communicating over IB and unitil now without any problems. It looks like an > issue inside LNET. Clients cannot LNET ping/connect the metadata and or > storage. But the clients are able to LNET ping each other. Clients which not > have been rebooted, are working fine and have their mounts on our Lustre > filesystem. > > Lustre client log: > > Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64 > LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180] > > LustreError: 15c-8: MGC172.23.55.211@o2ib: The configuration from log > 'pgdata01-client' failed (-5). This may be the result of communication > errors between this node and the MGS, a bad configuration, or other errors. > See the syslog for more information. > LustreError: 3812:0:(llite_lib.c:1046:ll_fill_super()) Unable to process > log: -5 > Lustre: Unmounted pgdata01-client > LustreError: 3812:0:(obd_mount.c:1325:lustre_fill_super()) Unable to mount > (-5) > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.212@o2ib > rejected: consumer defined fatal error > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 previous > similar message > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent > has failed due to network error: [sent 1492789626/real 1492789626] > req@ffff88105af2cc00 x1565303228072004/t0(0) > o250->MGC172.23.55.211@[email protected]@o2ib:26/25 lens 400/544 e 0 to 1 > dl 1492789631 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) Skipped 1 > previous similar message > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit > expired req@ffff882041ffc000 x1565303228071996/t0(0) > o101->MGC172.23.55.211@[email protected]@o2ib:26/25 lens 328/344 e 0 to 0 > dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped 2 > previous similar messages > LustreError: 15c-8: MGC172.23.55.211@o2ib: The configuration from log > 'pghome01-client' failed (-5). This may be the result of communication > errors between this node and the MGS, a bad configuration, or other errors. > See the syslog for more information. > LustreError: 3826:0:(llite_lib.c:1046:ll_fill_super()) Unable to process > log: -5 > > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.212@o2ib > rejected: consumer defined fatal error > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 previous > similar message > LNet: 3755:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from > 172.23.55.211@o2ib failed: 5 > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.211@o2ib > rejected: consumer defined fatal error > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 previous > similar message > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting > messages for 172.23.55.211@o2ib: connection failed > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting > messages for 172.23.55.212@o2ib: connection failed > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from > 172.23.55.212@o2ib failed: 5 > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Skipped 17 previous > similar messages > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting > messages for 172.23.55.211@o2ib: connection failed > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from > 172.23.55.212@o2ib failed: 5 > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting > messages for 172.23.55.212@o2ib: connection failed > > LNET ping of a metadata-node: > > [root@pg-gpu01 ~]# lctl ping 172.23.55.211@o2ib > failed to ping 172.23.55.211@o2ib: Input/output error > > LNET ping of the number 2 metadata-node: > > [root@pg-gpu01 ~]# lctl ping 172.23.55.212@o2ib > failed to ping 172.23.55.212@o2ib: Input/output error > > LNET ping of a random compute-node: > > [root@pg-gpu01 ~]# lctl ping 172.23.52.5@o2ib > 12345-0@lo > 12345-172.23.52.5@o2ib > > LNET to OST01: > > [root@pg-gpu01 ~]# lctl ping 172.23.55.201@o2ib > failed to ping 172.23.55.201@o2ib: Input/output error > > LNET to OST02: > > [root@pg-gpu01 ~]# lctl ping 172.23.55.202@o2ib > failed to ping 172.23.55.202@o2ib: Input/output error > > 'normal' pings (on ip level) works fine: > > [root@pg-gpu01 ~]# ping 172.23.55.201 > PING 172.23.55.201 (172.23.55.201) 56(84) bytes of data. > 64 bytes from 172.23.55.201: icmp_seq=1 ttl=64 time=0.741 ms > > [root@pg-gpu01 ~]# ping 172.23.55.202 > PING 172.23.55.202 (172.23.55.202) 56(84) bytes of data. > 64 bytes from 172.23.55.202: icmp_seq=1 ttl=64 time=0.704 ms > > lctl on a rebooted node: > > [root@pg-gpu01 ~]# lctl dl > > lctl on a not rebooted node: > > [root@pg-node005 ~]# lctl dl > 0 UP mgc MGC172.23.55.211@o2ib 94bd1c8a-512f-b920-9a4e-a6aced3d386d 5 > 1 UP lov pgtemp01-clilov-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 4 > 2 UP lmv pgtemp01-clilmv-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 4 > 3 UP mdc pgtemp01-MDT0000-mdc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 4 UP osc pgtemp01-OST0001-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 5 UP osc pgtemp01-OST0003-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 6 UP osc pgtemp01-OST0005-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 7 UP osc pgtemp01-OST0007-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 8 UP osc pgtemp01-OST0009-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 9 UP osc pgtemp01-OST000b-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 10 UP osc pgtemp01-OST000d-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 11 UP osc pgtemp01-OST000f-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 12 UP osc pgtemp01-OST0011-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 13 UP osc pgtemp01-OST0002-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 14 UP osc pgtemp01-OST0004-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 15 UP osc pgtemp01-OST0006-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 16 UP osc pgtemp01-OST0008-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 17 UP osc pgtemp01-OST000a-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 18 UP osc pgtemp01-OST000c-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 19 UP osc pgtemp01-OST000e-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 20 UP osc pgtemp01-OST0010-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 21 UP osc pgtemp01-OST0012-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 22 UP osc pgtemp01-OST0013-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 23 UP osc pgtemp01-OST0015-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 24 UP osc pgtemp01-OST0017-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 25 UP osc pgtemp01-OST0014-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 26 UP osc pgtemp01-OST0016-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 27 UP osc pgtemp01-OST0018-osc-ffff88206906d400 > 281c441f-8aa3-ab56-8812-e459d308f47c 5 > 28 UP lov pgdata01-clilov-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 4 > 29 UP lmv pgdata01-clilmv-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 4 > 30 UP mdc pgdata01-MDT0000-mdc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 31 UP osc pgdata01-OST0001-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 32 UP osc pgdata01-OST0003-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 33 UP osc pgdata01-OST0005-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 34 UP osc pgdata01-OST0007-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 35 UP osc pgdata01-OST0009-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 36 UP osc pgdata01-OST000b-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 37 UP osc pgdata01-OST000d-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 38 UP osc pgdata01-OST000f-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 39 UP osc pgdata01-OST0002-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 40 UP osc pgdata01-OST0004-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 41 UP osc pgdata01-OST0006-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 42 UP osc pgdata01-OST0008-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 43 UP osc pgdata01-OST000a-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 44 UP osc pgdata01-OST000c-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 45 UP osc pgdata01-OST000e-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 46 UP osc pgdata01-OST0010-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 47 UP osc pgdata01-OST0013-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 48 UP osc pgdata01-OST0015-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 49 UP osc pgdata01-OST0017-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 50 UP osc pgdata01-OST0014-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 51 UP osc pgdata01-OST0016-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 52 UP osc pgdata01-OST0018-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 53 UP osc pgdata01-OST0019-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 54 UP osc pgdata01-OST001a-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 55 UP osc pgdata01-OST001b-osc-ffff88204bab6400 > 996b1742-82eb-281c-c322-e244672d5225 5 > 56 UP lov pghome01-clilov-ffff88204bb50000 > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4 > 57 UP lmv pghome01-clilmv-ffff88204bb50000 > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4 > 58 UP mdc pghome01-MDT0000-mdc-ffff88204bb50000 > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5 > 59 UP osc pghome01-OST0011-osc-ffff88204bb50000 > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5 > 60 UP osc pghome01-OST0012-osc-ffff88204bb50000 > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5 > > Please help, any clues/advice/hints/tips are appricated > > -- > > Vriendelijke groet, > > Ger Strikwerda > Chef Special > Rijksuniversiteit Groningen > Centrum voor Informatie Technologie > Unit Pragmatisch Systeembeheer > > Smitsborg > Nettelbosje 1 > 9747 AJ Groningen > Tel. 050 363 9276 > > "God is hard, God is fair > some men he gave brains, others he gave hair" > > > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
