down is the state, when the node is reachable, has a valid config and the mmfsd is not started
not sure, what steps you did now/where you are now.. but once the node has a valid config (by removing/re-adding, by restoring) and the rpms installed (don't forget to build portability layer)
your 're good to go...
From: "J. Eric Wonderley" <[email protected]>
To: gpfsug main discussion list <[email protected]>
Date: 02/03/2017 02:47 PM
Subject: Re: [gpfsug-discuss] proper gpfs shutdown when node disappears
Sent by: [email protected]
Well we got it into the down state using mmsdrrestore -p to recover stuff into /var/mmfs/gen to cl004.
Anyhow we ended up unknown for cl004 when it powered off. Short of removing node, unknown is the state you get.
Unknown seems stable for a hopefully short outage of cl004.
Thanks
On Thu, Feb 2, 2017 at 4:28 PM, Olaf Weiser <[email protected]> wrote:
many ways lead to Rome .. and I agree .. mmexpelnode is a nice command ..
another approach...
power it off .. (not reachable by ping) .. mmdelnode ... power on/boot ... mmaddnode ..
From: Aaron Knister <[email protected]>
To: <[email protected]>
Date: 02/02/2017 08:37 PM
Subject: Re: [gpfsug-discuss] proper gpfs shutdown when node disappears
Sent by: [email protected]
You could forcibly expel the node (one of my favorite GPFS commands):
mmexpelnode -N $nodename
and then power it off after the expulsion is complete and then do
mmepelenode -r -N $nodename
which will allow it to join the cluster next time you try and start up
GPFS on it. You'll still likely have to go through recovery but you'll
skip the part where GPFS wonders where the node went prior to it
expelling it.
-Aaron
On 2/2/17 2:28 PM, [email protected]wrote:
> On Thu, 02 Feb 2017 18:28:22 +0100, "Olaf Weiser" said:
>
>> but the /var/mmfs DIR is obviously damaged/empty .. what ever.. that's why you
>> see a message like this..
>> have you reinstalled that node / any backup/restore thing ?
>
> The internal RAID controller died a horrid death and basically took
> all the OS partitions with it. So the node was just sort of limping along,
> where the mmfsd process was still coping because it wasn't doing any
> I/O to the OS partitions - but 'ssh bad-node mmshutdown' wouldn't work
> because that requires accessing stuff in /var.
>
> At that point, it starts getting tempting to just use ipmitool from
> another node to power the comatose one down - but that often causes
> a cascade of other issues while things are stuck waiting for timeouts.
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
