On Mon, May 1, 2017 at 3:45 PM, Strikwerda, Ger <[email protected]> wrote:
> Hi Eli, > > We have a 180+ compute-cluster IB/10Gb connected with Lustre storage IB/10 > Gb connected. We have multiple IB switches with the master/core/big switch > manageable via webmanagement. This is switch is a Mellanox SX6036 FDR > switch. 1 subnet manager is supposed to be running at this switch. And > using 'sminfo' on the clients we got info about the subnet manager being > alive. But when we looked via the webmanagement the subnet-manager was > unstable. The reason why is unknown. Could be faulty firmware. During the > weekend the system was running fine. > Did anything specific make you look in the switch, or just after all other things were checked you checked there? > > > > > > > On Mon, May 1, 2017 at 2:18 PM, E.S. Rosenberg <[email protected] > > wrote: > >> >> >> On Mon, May 1, 2017 at 11:46 AM, Strikwerda, Ger <[email protected] >> > wrote: >> >>> Hi all, >>> >>> Our clients-failed-to-mount/lctl ping horror, turned out to be a failing >>> subnet manager issue. We did no see an issue runnning 'sminfo' but on the >>> IB switch we could see that the subnetmanager was unstable. This caused >>> mayhem on the IB/Lustre setup. >>> >> Can you describe a bit more of how you found this? >> You are running an SM on the switches? >> Like this if someone else runs into this they will be able to check this >> too.... >> >>> >>> Thanks everybody for their help/advice/hints. Good to see how this >>> active community works! >>> >> Indeed. >> Eli >> >>> >>> >>> >>> >>> On Tue, Apr 25, 2017 at 8:17 PM, E.S. Rosenberg < >>> [email protected]> wrote: >>> >>>> >>>> >>>> On Tue, Apr 25, 2017 at 7:41 PM, Oucharek, Doug S < >>>> [email protected]> wrote: >>>> >>>>> That specific message happens when the “magic” u32 field at the start >>>>> of a message does not match what we are expecting. We do check if the >>>>> message was transmitted as a different endian from us so when you see this >>>>> error, we assume that message has been corrupted or the sender is using an >>>>> invalid magic value. I don’t believe this value has changed in the >>>>> history >>>>> of the LND so this is more likely corruption of some sort. >>>>> >>>> >>>> OT: this information should probably be added to LU-2977 which >>>> specifically includes the question: What does "consumer defined fatal >>>> error" mean and why is this connection rejected? >>>> >>>> >>>> >>>>> Doug >>>>> >>>>> > On Apr 25, 2017, at 2:29 AM, Dilger, Andreas < >>>>> [email protected]> wrote: >>>>> > >>>>> > I'm not an LNet expert, but I think the critical issue to focus on >>>>> is: >>>>> > >>>>> > Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573 >>>>> .el6.x86_64 >>>>> > LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180] >>>>> > LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) >>>>> 172.23.55.211@o2ib rejected: consumer defined fatal error >>>>> > >>>>> > This means that the LND didn't connect at startup time, but I don't >>>>> know what the cause is. >>>>> > The error that generates this message is IB_CM_REJ_CONSUMER_DEFINED, >>>>> but I don't know enough about IB to tell you what that means. Some of the >>>>> later code is checking for mismatched Lustre versions, but it doesn't even >>>>> get that far. >>>>> > >>>>> > Cheers, Andreas >>>>> > >>>>> >> On Apr 25, 2017, at 02:21, Strikwerda, Ger <[email protected]> >>>>> wrote: >>>>> >> >>>>> >> Hi Raj, >>>>> >> >>>>> >> [root@pg-gpu01 ~]# lustre_rmmod >>>>> >> >>>>> >> [root@pg-gpu01 ~]# modprobe -v lustre >>>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/n >>>>> et/lustre/libcfs.ko >>>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f >>>>> s/lustre/lvfs.ko >>>>> >> insmod >>>>> >> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/net/lustre/lnet.ko >>>>> networks=o2ib(ib0) >>>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f >>>>> s/lustre/obdclass.ko >>>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f >>>>> s/lustre/ptlrpc.ko >>>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f >>>>> s/lustre/fid.ko >>>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f >>>>> s/lustre/mdc.ko >>>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f >>>>> s/lustre/osc.ko >>>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f >>>>> s/lustre/lov.ko >>>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f >>>>> s/lustre/lustre.ko >>>>> >> >>>>> >> dmesg: >>>>> >> >>>>> >> LNet: HW CPU cores: 24, npartitions: 4 >>>>> >> alg: No test for crc32 (crc32-table) >>>>> >> alg: No test for adler32 (adler32-zlib) >>>>> >> alg: No test for crc32 (crc32-pclmul) >>>>> >> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573 >>>>> .el6.x86_64 >>>>> >> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180] >>>>> >> >>>>> >> But no luck, >>>>> >> >>>>> >> [root@pg-gpu01 ~]# lctl ping 172.23.55.211@o2ib >>>>> >> failed to ping 172.23.55.211@o2ib: Input/output error >>>>> >> >>>>> >> [root@pg-gpu01 ~]# mount /home >>>>> >> mount.lustre: mount 172.23.55.211@o2ib:172.23.55.212@o2ib:/pghome01 >>>>> at /home failed: Input/output error >>>>> >> Is the MGS running? >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> On Mon, Apr 24, 2017 at 7:53 PM, Raj <[email protected]> wrote: >>>>> >> Yes, this is strange. Normally, I have seen that credits mismatch >>>>> results this scenario but it doesn't look like this is the case. >>>>> >> >>>>> >> You wouldn't want to put mgs into capture debug messages as there >>>>> will be a lot of data. >>>>> >> >>>>> >> I guess you already tried removing the lustre drivers and adding it >>>>> again ? >>>>> >> lustre_rmmod >>>>> >> modprobe -v lustre >>>>> >> >>>>> >> And check dmesg for any errors... >>>>> >> >>>>> >> >>>>> >> On Mon, Apr 24, 2017 at 12:43 PM Strikwerda, Ger < >>>>> [email protected]> wrote: >>>>> >> Hi Raj, >>>>> >> >>>>> >> When i do a lctl ping on a MGS server i do not see any logs at all. >>>>> Also not when i do a sucessfull ping from a working node. Is there a way >>>>> to >>>>> verbose the Lustre logging to see more detail on the LNET level? >>>>> >> >>>>> >> It is very strange that a rebooted node is able to lctl ping >>>>> compute nodes, but fails to lctl ping metadata and storage nodes. >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> On Mon, Apr 24, 2017 at 7:35 PM, Raj <[email protected]> wrote: >>>>> >> Ger, >>>>> >> It looks like default configuration of lustre. >>>>> >> >>>>> >> Do you see any error message on the MGS side while you are doing >>>>> lctl ping from the rebooted clients? >>>>> >> On Mon, Apr 24, 2017 at 12:27 PM Strikwerda, Ger < >>>>> [email protected]> wrote: >>>>> >> Hi Eli, >>>>> >> >>>>> >> Nothing can be mounted on the Lustre filesystems so the output is: >>>>> >> >>>>> >> [root@pg-gpu01 ~]# lfs df /home/ger/ >>>>> >> [root@pg-gpu01 ~]# >>>>> >> >>>>> >> Empty.. >>>>> >> >>>>> >> >>>>> >> >>>>> >> On Mon, Apr 24, 2017 at 7:24 PM, E.S. Rosenberg <[email protected]> >>>>> wrote: >>>>> >> >>>>> >> >>>>> >> On Mon, Apr 24, 2017 at 8:19 PM, Strikwerda, Ger < >>>>> [email protected]> wrote: >>>>> >> Hallo Eli, >>>>> >> >>>>> >> Logfile/syslog on the client-side: >>>>> >> >>>>> >> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573 >>>>> .el6.x86_64 >>>>> >> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180] >>>>> >> LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) >>>>> 172.23.55.211@o2ib rejected: consumer defined fatal error >>>>> >> >>>>> >> lctl df /path/to/some/file >>>>> >> >>>>> >> gives nothing useful? (the second one will dump *a lot*) >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> On Mon, Apr 24, 2017 at 7:16 PM, E.S. Rosenberg < >>>>> [email protected]> wrote: >>>>> >> >>>>> >> >>>>> >> On Mon, Apr 24, 2017 at 8:13 PM, Strikwerda, Ger < >>>>> [email protected]> wrote: >>>>> >> Hi Raj (and others), >>>>> >> >>>>> >> In which file should i state the credits/peer_credits stuff? >>>>> >> >>>>> >> Perhaps relevant config-files: >>>>> >> >>>>> >> [root@pg-gpu01 ~]# cd /etc/modprobe.d/ >>>>> >> >>>>> >> [root@pg-gpu01 modprobe.d]# ls >>>>> >> anaconda.conf blacklist-kvm.conf dist-alsa.conf >>>>> dist-oss.conf ib_ipoib.conf lustre.conf openfwwf.conf >>>>> >> blacklist.conf blacklist-nouveau.conf dist.conf >>>>> freeipmi-modalias.conf ib_sdp.conf mlnx.conf truescale.conf >>>>> >> >>>>> >> [root@pg-gpu01 modprobe.d]# cat ./ib_ipoib.conf >>>>> >> alias netdev-ib* ib_ipoib >>>>> >> >>>>> >> [root@pg-gpu01 modprobe.d]# cat ./mlnx.conf >>>>> >> # Module parameters for MLNX_OFED kernel modules >>>>> >> >>>>> >> [root@pg-gpu01 modprobe.d]# cat ./lustre.conf >>>>> >> options lnet networks=o2ib(ib0) >>>>> >> >>>>> >> Are there more Lustre/LNET options that could help in this >>>>> situation? >>>>> >> >>>>> >> What about the logfiles? >>>>> >> Any error messages in syslog? lctl debug options? >>>>> >> Veel geluk, >>>>> >> Eli >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> On Mon, Apr 24, 2017 at 7:02 PM, Raj <[email protected]> wrote: >>>>> >> May be worth checking your lnet credits and peer_credits in >>>>> /etc/modprobe.d ? >>>>> >> You can compare between working hosts and non working hosts. >>>>> >> Thanks >>>>> >> _Raj >>>>> >> >>>>> >> On Mon, Apr 24, 2017 at 10:10 AM Strikwerda, Ger < >>>>> [email protected]> wrote: >>>>> >> Hi Rick, >>>>> >> >>>>> >> Even without iptables rules and loading the correct modules >>>>> afterwards, we get the same results: >>>>> >> >>>>> >> [root@pg-gpu01 sysconfig]# iptables --list >>>>> >> Chain INPUT (policy ACCEPT) >>>>> >> target prot opt source destination >>>>> >> >>>>> >> Chain FORWARD (policy ACCEPT) >>>>> >> target prot opt source destination >>>>> >> >>>>> >> Chain OUTPUT (policy ACCEPT) >>>>> >> target prot opt source destination >>>>> >> >>>>> >> Chain LOGDROP (0 references) >>>>> >> target prot opt source destination >>>>> >> LOG all -- anywhere anywhere LOG >>>>> level warning >>>>> >> DROP all -- anywhere anywhere >>>>> >> >>>>> >> [root@pg-gpu01 sysconfig]# modprobe lnet >>>>> >> >>>>> >> [root@pg-gpu01 sysconfig]# modprobe lustre >>>>> >> >>>>> >> [root@pg-gpu01 sysconfig]# lctl ping 172.23.55.211@o2ib >>>>> >> >>>>> >> failed to ping 172.23.55.211@o2ib: Input/output error >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> On Mon, Apr 24, 2017 at 4:59 PM, Mohr Jr, Richard Frank (Rick Mohr) >>>>> <[email protected]> wrote: >>>>> >> This might be a long shot, but have you checked for possible >>>>> firewall rules that might be causing the issue? I’m wondering if there is >>>>> a chance that some rules were added after the nodes were up to allow >>>>> Lustre >>>>> access, and when a node got rebooted, it lost the rules. >>>>> >> >>>>> >> -- >>>>> >> Rick Mohr >>>>> >> Senior HPC System Administrator >>>>> >> National Institute for Computational Sciences >>>>> >> http://www.nics.tennessee.edu >>>>> >> >>>>> >> >>>>> >>> On Apr 24, 2017, at 10:19 AM, Strikwerda, Ger < >>>>> [email protected]> wrote: >>>>> >>> >>>>> >>> Hi Russell, >>>>> >>> >>>>> >>> Thanks for the IB subnet clues: >>>>> >>> >>>>> >>> [root@pg-gpu01 ~]# ibv_devinfo >>>>> >>> hca_id: mlx4_0 >>>>> >>> transport: InfiniBand (0) >>>>> >>> fw_ver: 2.32.5100 >>>>> >>> node_guid: f452:1403:00f5:4620 >>>>> >>> sys_image_guid: f452:1403:00f5:4623 >>>>> >>> vendor_id: 0x02c9 >>>>> >>> vendor_part_id: 4099 >>>>> >>> hw_ver: 0x1 >>>>> >>> board_id: MT_1100120019 >>>>> >>> phys_port_cnt: 1 >>>>> >>> port: 1 >>>>> >>> state: PORT_ACTIVE (4) >>>>> >>> max_mtu: 4096 (5) >>>>> >>> active_mtu: 4096 (5) >>>>> >>> sm_lid: 1 >>>>> >>> port_lid: 185 >>>>> >>> port_lmc: 0x00 >>>>> >>> link_layer: InfiniBand >>>>> >>> >>>>> >>> [root@pg-gpu01 ~]# sminfo >>>>> >>> sminfo: sm lid 1 sm guid 0xf452140300f62320, activity count >>>>> 80878098 priority 0 state 3 SMINFO_MASTER >>>>> >>> >>>>> >>> Looks like the rebooted node is able to connect/contact IB/IB >>>>> subnetmanager >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> On Mon, Apr 24, 2017 at 4:14 PM, Russell Dekema <[email protected]> >>>>> wrote: >>>>> >>> At first glance, this sounds like your Infiniband subnet manager >>>>> may >>>>> >>> be down or malfunctioning. In this case, nodes which were already >>>>> up >>>>> >>> when the subnet manager was working will continue to be able to >>>>> >>> communicate over IB, but nodes which reboot after the SM goes down >>>>> >>> will not. >>>>> >>> >>>>> >>> You can test this theory by running the 'ibv_devinfo' command on >>>>> one >>>>> >>> of your rebooted nodes. If the relevant IB port is in state >>>>> PORT_INIT, >>>>> >>> this confirms there is a problem with your subnet manager. >>>>> >>> >>>>> >>> Sincerely, >>>>> >>> Rusty Dekema >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> On Mon, Apr 24, 2017 at 9:57 AM, Strikwerda, Ger >>>>> >>> <[email protected]> wrote: >>>>> >>>> Hi everybody, >>>>> >>>> >>>>> >>>> Here at the university of Groningen we are now experiencing a >>>>> strange Lustre >>>>> >>>> error. If a client reboots, it fails to mount the Lustre storage. >>>>> The client >>>>> >>>> is not able to reach the MSG service. The storage and nodes are >>>>> >>>> communicating over IB and unitil now without any problems. It >>>>> looks like an >>>>> >>>> issue inside LNET. Clients cannot LNET ping/connect the metadata >>>>> and or >>>>> >>>> storage. But the clients are able to LNET ping each other. >>>>> Clients which not >>>>> >>>> have been rebooted, are working fine and have their mounts on our >>>>> Lustre >>>>> >>>> filesystem. >>>>> >>>> >>>>> >>>> Lustre client log: >>>>> >>>> >>>>> >>>> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573 >>>>> .el6.x86_64 >>>>> >>>> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180] >>>>> >>>> >>>>> >>>> LustreError: 15c-8: MGC172.23.55.211@o2ib: The configuration >>>>> from log >>>>> >>>> 'pgdata01-client' failed (-5). This may be the result of >>>>> communication >>>>> >>>> errors between this node and the MGS, a bad configuration, or >>>>> other errors. >>>>> >>>> See the syslog for more information. >>>>> >>>> LustreError: 3812:0:(llite_lib.c:1046:ll_fill_super()) Unable to >>>>> process >>>>> >>>> log: -5 >>>>> >>>> Lustre: Unmounted pgdata01-client >>>>> >>>> LustreError: 3812:0:(obd_mount.c:1325:lustre_fill_super()) >>>>> Unable to mount >>>>> >>>> (-5) >>>>> >>>> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) >>>>> 172.23.55.212@o2ib >>>>> >>>> rejected: consumer defined fatal error >>>>> >>>> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped >>>>> 1 previous >>>>> >>>> similar message >>>>> >>>> Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ >>>>> Request sent >>>>> >>>> has failed due to network error: [sent 1492789626/real 1492789626] >>>>> >>>> req@ffff88105af2cc00 x1565303228072004/t0(0) >>>>> >>>> o250->MGC172.23.55.211@[email protected]@o2ib:26/25 lens >>>>> 400/544 e 0 to 1 >>>>> >>>> dl 1492789631 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 >>>>> >>>> Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) >>>>> Skipped 1 >>>>> >>>> previous similar message >>>>> >>>> LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) >>>>> @@@ send limit >>>>> >>>> expired req@ffff882041ffc000 x1565303228071996/t0(0) >>>>> >>>> o101->MGC172.23.55.211@[email protected]@o2ib:26/25 lens >>>>> 328/344 e 0 to 0 >>>>> >>>> dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 >>>>> >>>> LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) >>>>> Skipped 2 >>>>> >>>> previous similar messages >>>>> >>>> LustreError: 15c-8: MGC172.23.55.211@o2ib: The configuration >>>>> from log >>>>> >>>> 'pghome01-client' failed (-5). This may be the result of >>>>> communication >>>>> >>>> errors between this node and the MGS, a bad configuration, or >>>>> other errors. >>>>> >>>> See the syslog for more information. >>>>> >>>> LustreError: 3826:0:(llite_lib.c:1046:ll_fill_super()) Unable to >>>>> process >>>>> >>>> log: -5 >>>>> >>>> >>>>> >>>> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) >>>>> 172.23.55.212@o2ib >>>>> >>>> rejected: consumer defined fatal error >>>>> >>>> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped >>>>> 1 previous >>>>> >>>> similar message >>>>> >>>> LNet: 3755:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from >>>>> >>>> 172.23.55.211@o2ib failed: 5 >>>>> >>>> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) >>>>> 172.23.55.211@o2ib >>>>> >>>> rejected: consumer defined fatal error >>>>> >>>> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped >>>>> 1 previous >>>>> >>>> similar message >>>>> >>>> LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) >>>>> Deleting >>>>> >>>> messages for 172.23.55.211@o2ib: connection failed >>>>> >>>> LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) >>>>> Deleting >>>>> >>>> messages for 172.23.55.212@o2ib: connection failed >>>>> >>>> LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from >>>>> >>>> 172.23.55.212@o2ib failed: 5 >>>>> >>>> LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Skipped 17 >>>>> previous >>>>> >>>> similar messages >>>>> >>>> LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) >>>>> Deleting >>>>> >>>> messages for 172.23.55.211@o2ib: connection failed >>>>> >>>> LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from >>>>> >>>> 172.23.55.212@o2ib failed: 5 >>>>> >>>> LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) >>>>> Deleting >>>>> >>>> messages for 172.23.55.212@o2ib: connection failed >>>>> >>>> >>>>> >>>> LNET ping of a metadata-node: >>>>> >>>> >>>>> >>>> [root@pg-gpu01 ~]# lctl ping 172.23.55.211@o2ib >>>>> >>>> failed to ping 172.23.55.211@o2ib: Input/output error >>>>> >>>> >>>>> >>>> LNET ping of the number 2 metadata-node: >>>>> >>>> >>>>> >>>> [root@pg-gpu01 ~]# lctl ping 172.23.55.212@o2ib >>>>> >>>> failed to ping 172.23.55.212@o2ib: Input/output error >>>>> >>>> >>>>> >>>> LNET ping of a random compute-node: >>>>> >>>> >>>>> >>>> [root@pg-gpu01 ~]# lctl ping 172.23.52.5@o2ib >>>>> >>>> 12345-0@lo >>>>> >>>> 12345-172.23.52.5@o2ib >>>>> >>>> >>>>> >>>> LNET to OST01: >>>>> >>>> >>>>> >>>> [root@pg-gpu01 ~]# lctl ping 172.23.55.201@o2ib >>>>> >>>> failed to ping 172.23.55.201@o2ib: Input/output error >>>>> >>>> >>>>> >>>> LNET to OST02: >>>>> >>>> >>>>> >>>> [root@pg-gpu01 ~]# lctl ping 172.23.55.202@o2ib >>>>> >>>> failed to ping 172.23.55.202@o2ib: Input/output error >>>>> >>>> >>>>> >>>> 'normal' pings (on ip level) works fine: >>>>> >>>> >>>>> >>>> [root@pg-gpu01 ~]# ping 172.23.55.201 >>>>> >>>> PING 172.23.55.201 (172.23.55.201) 56(84) bytes of data. >>>>> >>>> 64 bytes from 172.23.55.201: icmp_seq=1 ttl=64 time=0.741 ms >>>>> >>>> >>>>> >>>> [root@pg-gpu01 ~]# ping 172.23.55.202 >>>>> >>>> PING 172.23.55.202 (172.23.55.202) 56(84) bytes of data. >>>>> >>>> 64 bytes from 172.23.55.202: icmp_seq=1 ttl=64 time=0.704 ms >>>>> >>>> >>>>> >>>> lctl on a rebooted node: >>>>> >>>> >>>>> >>>> [root@pg-gpu01 ~]# lctl dl >>>>> >>>> >>>>> >>>> lctl on a not rebooted node: >>>>> >>>> >>>>> >>>> [root@pg-node005 ~]# lctl dl >>>>> >>>> 0 UP mgc MGC172.23.55.211@o2ib 94bd1c8a-512f-b920-9a4e-a6aced3d386d >>>>> 5 >>>>> >>>> 1 UP lov pgtemp01-clilov-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 4 >>>>> >>>> 2 UP lmv pgtemp01-clilmv-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 4 >>>>> >>>> 3 UP mdc pgtemp01-MDT0000-mdc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 4 UP osc pgtemp01-OST0001-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 5 UP osc pgtemp01-OST0003-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 6 UP osc pgtemp01-OST0005-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 7 UP osc pgtemp01-OST0007-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 8 UP osc pgtemp01-OST0009-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 9 UP osc pgtemp01-OST000b-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 10 UP osc pgtemp01-OST000d-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 11 UP osc pgtemp01-OST000f-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 12 UP osc pgtemp01-OST0011-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 13 UP osc pgtemp01-OST0002-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 14 UP osc pgtemp01-OST0004-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 15 UP osc pgtemp01-OST0006-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 16 UP osc pgtemp01-OST0008-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 17 UP osc pgtemp01-OST000a-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 18 UP osc pgtemp01-OST000c-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 19 UP osc pgtemp01-OST000e-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 20 UP osc pgtemp01-OST0010-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 21 UP osc pgtemp01-OST0012-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 22 UP osc pgtemp01-OST0013-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 23 UP osc pgtemp01-OST0015-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 24 UP osc pgtemp01-OST0017-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 25 UP osc pgtemp01-OST0014-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 26 UP osc pgtemp01-OST0016-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 27 UP osc pgtemp01-OST0018-osc-ffff88206906d400 >>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5 >>>>> >>>> 28 UP lov pgdata01-clilov-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 4 >>>>> >>>> 29 UP lmv pgdata01-clilmv-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 4 >>>>> >>>> 30 UP mdc pgdata01-MDT0000-mdc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 31 UP osc pgdata01-OST0001-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 32 UP osc pgdata01-OST0003-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 33 UP osc pgdata01-OST0005-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 34 UP osc pgdata01-OST0007-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 35 UP osc pgdata01-OST0009-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 36 UP osc pgdata01-OST000b-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 37 UP osc pgdata01-OST000d-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 38 UP osc pgdata01-OST000f-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 39 UP osc pgdata01-OST0002-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 40 UP osc pgdata01-OST0004-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 41 UP osc pgdata01-OST0006-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 42 UP osc pgdata01-OST0008-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 43 UP osc pgdata01-OST000a-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 44 UP osc pgdata01-OST000c-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 45 UP osc pgdata01-OST000e-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 46 UP osc pgdata01-OST0010-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 47 UP osc pgdata01-OST0013-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 48 UP osc pgdata01-OST0015-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 49 UP osc pgdata01-OST0017-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 50 UP osc pgdata01-OST0014-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 51 UP osc pgdata01-OST0016-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 52 UP osc pgdata01-OST0018-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 53 UP osc pgdata01-OST0019-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 54 UP osc pgdata01-OST001a-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 55 UP osc pgdata01-OST001b-osc-ffff88204bab6400 >>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5 >>>>> >>>> 56 UP lov pghome01-clilov-ffff88204bb50000 >>>>> >>>> 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4 >>>>> >>>> 57 UP lmv pghome01-clilmv-ffff88204bb50000 >>>>> >>>> 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4 >>>>> >>>> 58 UP mdc pghome01-MDT0000-mdc-ffff88204bb50000 >>>>> >>>> 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5 >>>>> >>>> 59 UP osc pghome01-OST0011-osc-ffff88204bb50000 >>>>> >>>> 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5 >>>>> >>>> 60 UP osc pghome01-OST0012-osc-ffff88204bb50000 >>>>> >>>> 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5 >>>>> >>>> >>>>> >>>> Please help, any clues/advice/hints/tips are appricated >>>>> >>>> >>>>> >>>> -- >>>>> >>>> >>>>> >>>> Vriendelijke groet, >>>>> >>>> >>>>> >>>> Ger Strikwerda >>>>> >>>> Chef Special >>>>> >>>> Rijksuniversiteit Groningen >>>>> >>>> Centrum voor Informatie Technologie >>>>> >>>> Unit Pragmatisch Systeembeheer >>>>> >>>> >>>>> >>>> Smitsborg >>>>> >>>> Nettelbosje 1 >>>>> >>>> 9747 AJ Groningen >>>>> >>>> Tel. 050 363 9276 >>>>> >>>> >>>>> >>>> "God is hard, God is fair >>>>> >>>> some men he gave brains, others he gave hair" >>>>> >>>> >>>>> >>>> >>>>> >>>> _______________________________________________ >>>>> >>>> lustre-discuss mailing list >>>>> >>>> [email protected] >>>>> >>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >>>>> >>>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> -- >>>>> >>> Vriendelijke groet, >>>>> >>> >>>>> >>> Ger Strikwerda >>>>> >>> >>>>> >>> Chef Special >>>>> >>> Rijksuniversiteit Groningen >>>>> >>> Centrum voor Informatie Technologie >>>>> >>> Unit Pragmatisch Systeembeheer >>>>> >>> >>>>> >>> Smitsborg >>>>> >>> Nettelbosje 1 >>>>> >>> 9747 AJ Groningen >>>>> >>> Tel. 050 363 9276 >>>>> >>> >>>>> >>> >>>>> >>> "God is hard, God is fair >>>>> >>> some men he gave brains, others he gave hair" >>>>> >>> _______________________________________________ >>>>> >>> lustre-discuss mailing list >>>>> >>> [email protected] >>>>> >>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> Vriendelijke groet, >>>>> >> >>>>> >> Ger Strikwerda >>>>> >> >>>>> >> Chef Special >>>>> >> Rijksuniversiteit Groningen >>>>> >> Centrum voor Informatie Technologie >>>>> >> Unit Pragmatisch Systeembeheer >>>>> >> >>>>> >> Smitsborg >>>>> >> Nettelbosje 1 >>>>> >> 9747 AJ Groningen >>>>> >> Tel. 050 363 9276 >>>>> >> >>>>> >> >>>>> >> "God is hard, God is fair >>>>> >> some men he gave brains, others he gave hair" >>>>> >> _______________________________________________ >>>>> >> lustre-discuss mailing list >>>>> >> [email protected] >>>>> >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >>>>> >> >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> Vriendelijke groet, >>>>> >> >>>>> >> Ger Strikwerda >>>>> >> >>>>> >> Chef Special >>>>> >> Rijksuniversiteit Groningen >>>>> >> Centrum voor Informatie Technologie >>>>> >> Unit Pragmatisch Systeembeheer >>>>> >> >>>>> >> Smitsborg >>>>> >> Nettelbosje 1 >>>>> >> 9747 AJ Groningen >>>>> >> Tel. 050 363 9276 >>>>> >> >>>>> >> >>>>> >> "God is hard, God is fair >>>>> >> some men he gave brains, others he gave hair" >>>>> >> >>>>> >> _______________________________________________ >>>>> >> lustre-discuss mailing list >>>>> >> [email protected] >>>>> >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> Vriendelijke groet, >>>>> >> >>>>> >> Ger Strikwerda >>>>> >> >>>>> >> Chef Special >>>>> >> Rijksuniversiteit Groningen >>>>> >> Centrum voor Informatie Technologie >>>>> >> Unit Pragmatisch Systeembeheer >>>>> >> >>>>> >> Smitsborg >>>>> >> Nettelbosje 1 >>>>> >> 9747 AJ Groningen >>>>> >> Tel. 050 363 9276 >>>>> >> >>>>> >> >>>>> >> "God is hard, God is fair >>>>> >> some men he gave brains, others he gave hair" >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> Vriendelijke groet, >>>>> >> >>>>> >> Ger Strikwerda >>>>> >> >>>>> >> Chef Special >>>>> >> Rijksuniversiteit Groningen >>>>> >> Centrum voor Informatie Technologie >>>>> >> Unit Pragmatisch Systeembeheer >>>>> >> >>>>> >> Smitsborg >>>>> >> Nettelbosje 1 >>>>> >> 9747 AJ Groningen >>>>> >> Tel. 050 363 9276 >>>>> >> >>>>> >> >>>>> >> "God is hard, God is fair >>>>> >> some men he gave brains, others he gave hair" >>>>> >> >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> Vriendelijke groet, >>>>> >> >>>>> >> Ger Strikwerda >>>>> >> >>>>> >> Chef Special >>>>> >> Rijksuniversiteit Groningen >>>>> >> Centrum voor Informatie Technologie >>>>> >> Unit Pragmatisch Systeembeheer >>>>> >> >>>>> >> Smitsborg >>>>> >> Nettelbosje 1 >>>>> >> 9747 AJ Groningen >>>>> >> Tel. 050 363 9276 >>>>> >> >>>>> >> >>>>> >> "God is hard, God is fair >>>>> >> some men he gave brains, others he gave hair" >>>>> >> >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> Vriendelijke groet, >>>>> >> >>>>> >> Ger Strikwerda >>>>> >> >>>>> >> Chef Special >>>>> >> Rijksuniversiteit Groningen >>>>> >> Centrum voor Informatie Technologie >>>>> >> Unit Pragmatisch Systeembeheer >>>>> >> >>>>> >> Smitsborg >>>>> >> Nettelbosje 1 >>>>> >> 9747 AJ Groningen >>>>> >> Tel. 050 363 9276 >>>>> >> >>>>> >> >>>>> >> "God is hard, God is fair >>>>> >> some men he gave brains, others he gave hair" >>>>> >> _______________________________________________ >>>>> >> lustre-discuss mailing list >>>>> >> [email protected] >>>>> >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >>>>> > >>>>> > Cheers, Andreas >>>>> > -- >>>>> > Andreas Dilger >>>>> > Lustre Principal Architect >>>>> > Intel Corporation >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > _______________________________________________ >>>>> > lustre-discuss mailing list >>>>> > [email protected] >>>>> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >>>>> >>>>> _______________________________________________ >>>>> lustre-discuss mailing list >>>>> [email protected] >>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >>>>> >>>> >>>> >>> >>> >>> -- >>> >>> Vriendelijke groet, >>> >>> Ger StrikwerdaChef Special >>> Rijksuniversiteit Groningen >>> Centrum voor Informatie Technologie >>> Unit Pragmatisch Systeembeheer >>> >>> Smitsborg >>> Nettelbosje 1 >>> 9747 AJ Groningen >>> Tel. 050 363 9276 >>> "God is hard, God is fair >>> some men he gave brains, others he gave hair" >>> >>> >> > > > -- > > Vriendelijke groet, > > Ger StrikwerdaChef Special > Rijksuniversiteit Groningen > Centrum voor Informatie Technologie > Unit Pragmatisch Systeembeheer > > Smitsborg > Nettelbosje 1 > 9747 AJ Groningen > Tel. 050 363 9276 > "God is hard, God is fair > some men he gave brains, others he gave hair" > >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
