Just some ideas what to try. when you attempted mmdelnode, was that node still active with the IP address known in the cluster? If so, shut it down and try again. Mind the restrictions of mmdelnode though (can't delete NSD servers).
Try to fake one of the currently missing cluster nodes, or restore the old system backup to the reinstalled server, if available, or temporarily install gpfs SW there and copy over the GPFS config stuff from a node still active (/var/mmfs/), configure the admin and daemon IFs of the faked node on that machine, then try to start the cluster and see if it comes up with quorum, if it does then go ahead and cleanly de-configure what's needed to remove that node from the cluster gracefully. Once you reach quorum with the remaining nodes you are in safe area. Mit freundlichen Grüßen / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: [email protected] ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Geschäftsführung: Thomas Wolter, Sven Schooß Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Renata Maria Dart <[email protected]> To: Simon Thompson <[email protected]> Cc: gpfsug main discussion list <[email protected]> Date: 27/06/2018 21:30 Subject: Re: [gpfsug-discuss] gpfs client cluster, lost quorum, ccr issues Sent by: [email protected] Hi Simon, yes I ran mmsdrrestore -p <working node in the cluster> and that helped to create the /var/mmfs/ccr directory which was missing. But it didn't create a ccr.nodes file, so I ended up scp'ng that over by hand which I hope was the right thing to do. The one host that is no longer in service is still in that ccr.nodes file and when I try to mmdelnode it I get: root@ocio-gpu03 renata]# mmdelnode -N dhcp-os-129-164.slac.stanford.edu mmdelnode: Unable to obtain the GPFS configuration file lock. mmdelnode: GPFS was unable to obtain a lock from node dhcp-os-129-164.slac.stanford.edu. mmdelnode: Command failed. Examine previous error messages to determine cause. despite the fact that it doesn't respond to ping. The mmstartup on the newly reinstalled node fails as in my initial email. I should mention that the two "working" nodes are running 4.2.3.4. The person who reinstalled the node that won't start up put on 4.2.3.8. I didn't think that was the cause of this problem though and thought I would try to get the cluster talking again before upgrading the rest of the nodes or degrading the reinstalled one. Thanks, Renata On Wed, 27 Jun 2018, Simon Thompson wrote: >Have you tried running mmsdrestore in the reinstalled node to reads to the cluster and then try and startup gpfs on it? > > https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1pdg_mmsdrrest.htm > >Simon >________________________________________ >From: [email protected] [[email protected]] on behalf of Renata Maria Dart [[email protected]] >Sent: 27 June 2018 19:09 >To: [email protected] >Subject: [gpfsug-discuss] gpfs client cluster, lost quorum, ccr issues > >Hi, we have a client cluster of 4 nodes with 3 quorum nodes. One of the >quorum nodes is no longer in service and the other was reinstalled with >a newer OS, both without informing the gpfs admins. Gpfs is still >"working" on the two remaining nodes, that is, they continue to have access >to the gpfs data on the remote clusters. But, I can no longer get >any gpfs commands to work. On one of the 2 nodes that are still serving data, > >root@ocio-gpu01 ~]# mmlscluster >get file failed: Not enough CCR quorum nodes available (err 809) >gpfsClusterInit: Unexpected error from ccr fget mmsdrfs. Return code: 158 >mmlscluster: Command failed. Examine previous error messages to determine cause. > > >On the reinstalled node, this fails in the same way: > >[root@ocio-gpu02 ccr]# mmstartup >get file failed: Not enough CCR quorum nodes available (err 809) >gpfsClusterInit: Unexpected error from ccr fget mmsdrfs. Return code: 158 >mmstartup: Command failed. Examine previous error messages to determine cause. > > >I have looked through the users group interchanges but didn't find anything >that seems to fit this scenario. > >Is there a way to salvage this cluster? Can it be done without >shutting gpfs down on the 2 nodes that continue to work? > >Thanks for any advice, > >Renata Dart >SLAC National Accelerator Lb > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
