I work for a largish healthcare co. We finally started using (exploiting?) Hadoop this year.
The day before our big, C-Suite sponsored launch of Hadoop celebration, we realized that we could no longer ssh to our NN. Fail of all fails! Sorta. Nothing was really wrong! NN was up, running, and servicing all requests. Nagios and Ganglia were both satisfied with the replies from NN. And yet, I still had no access. Boo! This was purely a Linux issue. Being one to avoid "Game Day" failures, I decided to nuke the badly behaving NN. PLEASE NOTE! I had the right opportunity. We had few activities in Prod. I knew what was up with NN and I had a high level of confidence that things were working NORMALLY on NN regarding HDFS. So here is the slightly intoxicated order of events (I celebrated after success) NN always wrote to 2x Local NN dirs plus 1 NFS I made sure all DN's were stopped Our still running NN was Nuked -- AKA Power switch - No Mercy, Pwr Button Shutdown.!!! Then I cp'd all NFS NN info to my replacement NN (Not 2nd NN or SNN) local FS x2. I killed the in_use file as req'd I started NN Services on the new NN I restarted all DN Services on all DN's Somewhere along the line I altered the IP Addr to ensure that my new PNN was the one everyone was looking for. I specifically did NOT alter Name or IP within Any HDFD configs. This all happened in < 30 mins. It was easy. it was testable. Full credit to the Dev's who built PNN and HDFS... Even at 1.2x levels. ;) We all want federated NN services, but until then, at least there is recovery if you have NFS.. and abnormal sanity. ;)