Shortly before the 'All datanodes are bad' error, I saw: 2016-03-09 11:57:20,126 INFO [LeaseRenewer:hbase@aws-hbase-staging] retry.RetryInvocationHandler: Exception while invoking renewLease of class ClientNamenodeProtocolTranslatorPB over staging-aws-hbase-2.aws.px/10.231.16.197:8020. Trying to fail over immediately. org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category WRITE is not supported in state standby at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1774) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1313)
Looks like your hdfs was transitioning. FYI On Wed, Mar 9, 2016 at 7:25 AM, Ted Yu <yuzhih...@gmail.com> wrote: > From the region server log, did you notice this: > > 2016-03-09 11:59:49,159 WARN > [regionserver/staging-aws-hbase-data-1.aws.px/10.231.16.30:16020] > wal.ProtobufLogWriter: Failed to write trailer, non-fatal, continuing... > java.io.IOException: All datanodes > DatanodeInfoWithStorage[10.231.16.30:50010,DS-a49fd123-fefc-46e2-83ca-6ae462081702,DISK] > are bad. Aborting... > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1084) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402) > > > Was hdfs stable around that time ? > > I am more interested in the failure of fresh 1.2.0 installation. > > Please pastebin server logs for that incident. > > > On Wed, Mar 9, 2016 at 6:19 AM, Michal Medvecky <medve...@pexe.so> wrote: > >> You can check both logs at >> >> http://michal.medvecky.net/log-master.txt >> http://michal.medvecky.net/log-rs.txt >> >> First restart after upgrade happened at 10:29. >> >> I did not find anything useful. >> >> Michal >> >> On Wed, Mar 9, 2016 at 2:58 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >> > Can you take a look at data-1 server log around this time frame to see >> > what happened ? >> > >> > Thanks >> > >> > > On Mar 9, 2016, at 3:44 AM, Michal Medvecky <medve...@pexe.so> wrote: >> > > >> > > Hello, >> > > >> > > I upgraded my single hbase master and single hbase regionserver from >> > 1.1.3 >> > > to 1.2.0, by simply stopping both, upgrading packages (i download >> binary >> > > packages from hbase.org) and starting them again. >> > > >> > > It did not come up; master is stuck in assigning hbase:meta: >> > > >> > > 2016-03-09 11:28:11,491 INFO >> > > [staging-aws-hbase-3:16000.activeMasterManager] >> master.AssignmentManager: >> > > Processing 1588230740 in state: M_ZK_REGION_OFFLINE >> > > 2016-03-09 11:28:11,491 INFO >> > > [staging-aws-hbase-3:16000.activeMasterManager] master.RegionStates: >> > > Transition {1588230740 state=OFFLINE, ts=1457522891475, server=null} >> to >> > > {1588230740 state=OFFLINE, ts=1457522891491, >> > > server=staging-aws-hbase-data-1.aws.px,16020,1457522034036} >> > > 2016-03-09 11:28:11,492 INFO >> > > [staging-aws-hbase-3:16000.activeMasterManager] >> > > zookeeper.MetaTableLocator: Setting hbase:meta region location in >> > ZooKeeper >> > > as staging-aws-hbase-data-1.aws.px,16020,1457522034036 >> > > 2016-03-09 11:42:10,611 ERROR [Thread-65] master.HMaster: Master >> failed >> > to >> > > complete initialization after 900000ms. Please consider submitting a >> bug >> > > report including a thread dump of this process. >> > > >> > > Did I miss something? >> > > >> > > I tried downgrading back to 1.1.3 but later realized this is not >> > supported >> > > (and does not work of course). >> > > >> > > Thank you >> > > >> > > Michal >> > >> > >