[ https://issues.apache.org/jira/browse/HBASE-25238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226811#comment-17226811 ]
Michael Stack commented on HBASE-25238: --------------------------------------- Here is log from manual migration. I ran a 2.0.x cluster, loaded it w/ some data. I then stopped the 2.0.x Master. Started a 2.4.x Master. Below you see successful migration of store from old format to new. All the while the old RegionServer kept heartbeating though it was using old ClusterReport format. {code:java} 2020-11-05 16:27:04,684 INFO [master/hbasedn020:16000:becomeActiveMaster] region.RegionProcedureStore: Starting Region Procedure Store lease recovery... 2020-11-05 16:27:04,685 INFO [master/hbasedn020:16000:becomeActiveMaster] region.RegionProcedureStore: The old WALProcedureStore wal directory hdfs://nameservice1/tmp/stack.wal/MasterProcWALs exists, migrating... 2020-11-05 16:27:04,694 INFO [master/hbasedn020:16000:becomeActiveMaster] util.RecoverLeaseFSUtils: Recover lease on dfs file hdfs://nameservice1/tmp/stack.wal/MasterProcWALs/pv2-00000000000000000020.log 2020-11-05 16:27:04,697 INFO [master/hbasedn020:16000:becomeActiveMaster] util.RecoverLeaseFSUtils: Recovered lease, attempt=0 on file=hdfs://nameservice1/tmp/stack.wal/MasterProcWALs/pv2-00000000000000000020.log after 3ms 2020-11-05 16:27:04,704 WARN [master/hbasedn020:16000:becomeActiveMaster] wal.WALProcedureStore: Unable to read tracker for hdfs://nameservice1/tmp/stack.wal/MasterProcWALs/pv2-00000000000000000020.log org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat$InvalidWALDataException: Missing trailer: size=18 startPos=18 at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.readTrailer(ProcedureWALFormat.java:182) at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile.readTrailer(ProcedureWALFile.java:93) at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile.readTracker(ProcedureWALFile.java:100) at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLog(WALProcedureStore.java:1389) at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLogs(WALProcedureStore.java:1338) at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.recoverLease(WALProcedureStore.java:416) at org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.tryMigrate(RegionProcedureStore.java:180) at org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.recoverLease(RegionProcedureStore.java:257) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:587) at org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1560) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:925) at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2182) at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:603) at java.base/java.lang.Thread.run(Thread.java:834) 2020-11-05 16:27:04,710 INFO [master/hbasedn020:16000:becomeActiveMaster] wal.WALProcedureStore: Rolled new Procedure Store WAL, id=21 2020-11-05 16:27:04,714 INFO [master/hbasedn020:16000:becomeActiveMaster] wal.ProcedureWALFormatReader: Rebuilding tracker for hdfs://nameservice1/tmp/stack.wal/MasterProcWALs/pv2-00000000000000000020.log 2020-11-05 16:27:04,716 WARN [master/hbasedn020:16000:becomeActiveMaster] wal.ProcedureWALFormatReader: Nothing left to decode. Exiting with missing EOF, log=hdfs://nameservice1/tmp/stack.wal/MasterProcWALs/pv2-00000000000000000020.log 2020-11-05 16:27:04,716 INFO [master/hbasedn020:16000:becomeActiveMaster] wal.ProcedureWALFormatReader: Read 0 entries in hdfs://nameservice1/tmp/stack.wal/MasterProcWALs/pv2-00000000000000000020.log 2020-11-05 16:27:04,737 INFO [master/hbasedn020:16000:becomeActiveMaster] wal.WALProcedureStore: Rolled new Procedure Store WAL, id=22 2020-11-05 16:27:04,737 INFO [master/hbasedn020:16000:becomeActiveMaster] wal.WALProcedureStore: Remove all state logs with ID less than 21, since no active procedures 2020-11-05 16:27:04,737 INFO [master/hbasedn020:16000:becomeActiveMaster] wal.ProcedureWALFile: Archiving hdfs://nameservice1/tmp/stack.wal/MasterProcWALs/pv2-00000000000000000020.log to hdfs://nameservice1/tmp/stack.wal/oldWALs/pv2-00000000000000000020.log 2020-11-05 16:27:04,738 INFO [master/hbasedn020:16000:becomeActiveMaster] wal.ProcedureWALFile: Archiving hdfs://nameservice1/tmp/stack.wal/MasterProcWALs/pv2-00000000000000000021.log to hdfs://nameservice1/tmp/stack.wal/oldWALs/pv2-00000000000000000021.log 2020-11-05 16:27:04,743 INFO [master/hbasedn020:16000:becomeActiveMaster] region.RegionProcedureStore: Migrated 0 existing procedures from the old storage format. 2020-11-05 16:27:04,743 INFO [master/hbasedn020:16000:becomeActiveMaster] region.RegionProcedureStore: The WALProcedureStore max pid is 0, and the max pid of all loaded procedures is -1 2020-11-05 16:27:04,743 INFO [master/hbasedn020:16000:becomeActiveMaster] wal.WALProcedureStore: Stopping the WAL Procedure Store, isAbort=false 2020-11-05 16:27:04,755 INFO [master/hbasedn020:16000:becomeActiveMaster] region.RegionProcedureStore: Migration of WALProcedureStore finished 2020-11-05 16:27:04,755 INFO [master/hbasedn020:16000:becomeActiveMaster] procedure2.ProcedureExecutor: Recovered RegionProcedureStore lease in 70 msec 2020-11-05 16:27:04,782 INFO [master/hbasedn020:16000:becomeActiveMaster] procedure2.ProcedureExecutor: Loaded RegionProcedureStore in 26 msec {code} > Upgrading HBase from 2.2.0 to 2.3.x fails because of “Message missing > required fields: state” > --------------------------------------------------------------------------------------------- > > Key: HBASE-25238 > URL: https://issues.apache.org/jira/browse/HBASE-25238 > Project: HBase > Issue Type: Bug > Affects Versions: 2.2.0 > Reporter: Zhuqi Jin > Priority: Critical > > When we upgraded HBASE cluster from 2.0.0-RC0 to 2.3.0 or 2.3.3, the HMaster > on upgraded node failed to start. > The error message is shown below: > {code:java} > 2020-11-02 23:04:01,998 ERROR [master/2c4006997f99:16000:becomeActiveMaster] > master.HMaster: Failed to become active > masterorg.apache.hbase.thirdparty.com.google.protobuf.InvalidProtocolBufferException: > Message missing required fields: state at > org.apache.hbase.thirdparty.com.google.protobuf.UninitializedMessageException.asInvalidProtocolBufferException(UninitializedMessageException.java:79) > at > org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.checkMessageInitialized(AbstractParser.java:68) > at > org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:120) > at > org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:125) > at > org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:48) > at > org.apache.hbase.thirdparty.com.google.protobuf.Any.unpack(Any.java:228) > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil$StateSerializer.deserialize(ProcedureUtil.java:124) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.deserializeStateData(RegionRemoteProcedureBase.java:352) > at > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure.deserializeStateData(OpenRegionProcedure.java:72) > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProcedure(ProcedureUtil.java:294) > at > org.apache.hadoop.hbase.procedure2.store.ProtoAndProcedure.getProcedure(ProtoAndProcedure.java:43) > at > org.apache.hadoop.hbase.procedure2.store.InMemoryProcedureIterator.next(InMemoryProcedureIterator.java:90) > at > org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore$1.load(RegionProcedureStore.java:194) > at > org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore$2.load(WALProcedureStore.java:474) > at > org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader.finish(ProcedureWALFormatReader.java:151) > at > org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.load(ProcedureWALFormat.java:103) > at > org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.load(WALProcedureStore.java:465) > at > org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.tryMigrate(RegionProcedureStore.java:184) > at > org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.recoverLease(RegionProcedureStore.java:257) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:587) > at > org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1572) > at > org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:950) > at > org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2240) > at > org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:622) > at java.lang.Thread.run(Thread.java:748)2020-11-02 23:04:01,998 ERROR > [master/2c4006997f99:16000:becomeActiveMaster] master.HMaster: ***** ABORTING > master 2c4006997f99,16000,1604358237412: Unhandled exception. Starting > shutdown. > *****org.apache.hbase.thirdparty.com.google.protobuf.InvalidProtocolBufferException: > Message missing required fields: state at > org.apache.hbase.thirdparty.com.google.protobuf.UninitializedMessageException.asInvalidProtocolBufferException(UninitializedMessageException.java:79) > at > org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.checkMessageInitialized(AbstractParser.java:68) > at > org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:120) > at > org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:125) > at > org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:48) > at > org.apache.hbase.thirdparty.com.google.protobuf.Any.unpack(Any.java:228) > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil$StateSerializer.deserialize(ProcedureUtil.java:124) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.deserializeStateData(RegionRemoteProcedureBase.java:352) > at > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure.deserializeStateData(OpenRegionProcedure.java:72) > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProcedure(ProcedureUtil.java:294) > at > org.apache.hadoop.hbase.procedure2.store.ProtoAndProcedure.getProcedure(ProtoAndProcedure.java:43) > at > org.apache.hadoop.hbase.procedure2.store.InMemoryProcedureIterator.next(InMemoryProcedureIterator.java:90) > at > org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore$1.load(RegionProcedureStore.java:194) > at > org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore$2.load(WALProcedureStore.java:474) > at > org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader.finish(ProcedureWALFormatReader.java:151) > at > org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.load(ProcedureWALFormat.java:103) > at > org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.load(WALProcedureStore.java:465) > at > org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.tryMigrate(RegionProcedureStore.java:184) > at > org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.recoverLease(RegionProcedureStore.java:257) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:587) > at > org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1572) > at > org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:950) > at > org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2240) > at > org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:622) > at java.lang.Thread.run(Thread.java:748)2020-11-02 23:04:01,999 INFO > [master/2c4006997f99:16000:becomeActiveMaster] regionserver.HRegionServer: > ***** STOPPING region server '2c4006997f99,16000,1604358237412' > *****2020-11-02 23:04:01,999 INFO > [master/2c4006997f99:16000:becomeActiveMaster] regionserver.HRegionServer: > STOPPED: Stopped by master/2c4006997f99:16000:becomeActiveMaster2020-11-02 > 23:04:02,814 INFO [2c4006997f99:16000.splitLogManager..Chore.1] > hbase.ScheduledChore: Chore: SplitLogManager Timeout Monitor was > stopped2020-11-02 23:04:03,652 INFO [master/2c4006997f99:16000] > ipc.NettyRpcServer: Stopping server on /252.17.1.2:160002020-11-02 > 23:04:03,658 INFO [master/2c4006997f99:16000] regionserver.HRegionServer: > Stopping infoServer2020-11-02 23:04:03,871 INFO [master/2c4006997f99:16000] > handler.ContextHandler: Stopped > o.e.j.w.WebAppContext@6136998b{/,null,UNAVAILABLE}{file:/hbase/hbase-webapps/master}2020-11-02 > 23:04:03,877 INFO [master/2c4006997f99:16000] server.AbstractConnector: > Stopped > ServerConnector@60d1b21f{HTTP/1.1,[http/1.1]}{0.0.0.0:16010}2020-11-02 > 23:04:03,878 INFO [master/2c4006997f99:16000] handler.ContextHandler: > Stopped > o.e.j.s.ServletContextHandler@aa004a0{/static,file:///hbase/hbase-webapps/static/,UNAVAILABLE}2020-11-02 > 23:04:03,878 INFO [master/2c4006997f99:16000] handler.ContextHandler: > Stopped > o.e.j.s.ServletContextHandler@5965844d{/logs,file:///hbase/logs/,UNAVAILABLE}2020-11-02 > 23:04:03,888 INFO [master/2c4006997f99:16000] regionserver.HRegionServer: > aborting server 2c4006997f99,16000,16043582374122020-11-02 23:04:03,889 INFO > [master/2c4006997f99:16000] regionserver.HRegionServer: stopping server > 2c4006997f99,16000,1604358237412; all regions closed.2020-11-02 23:04:03,889 > INFO [master/2c4006997f99:16000] hbase.ChoreService: Chore service for: > master/2c4006997f99:16000 had [] on shutdown2020-11-02 23:04:03,890 INFO > [master/2c4006997f99:16000] region.RegionProcedureStore: Stopping the Region > Procedure Store, isAbort=true2020-11-02 23:04:03,894 WARN > [master/2c4006997f99:16000] master.ActiveMasterManager: Failed get of master > address: java.io.IOException: Can't get master address from ZooKeeper; znode > data == null2020-11-02 23:04:03,894 INFO [master/2c4006997f99:16000] > region.MasterRegion: Closing local region {ENCODED => > 1595e783b53d99cd5eef43b6debb2682, NAME => > 'master:store,,1.1595e783b53d99cd5eef43b6debb2682.', STARTKEY => '', ENDKEY > => ''}, isAbort=true2020-11-02 23:04:03,901 INFO [master/2c4006997f99:16000] > regionserver.HRegion: Closing region > master:store,,1.1595e783b53d99cd5eef43b6debb2682.2020-11-02 23:04:03,903 INFO > [master/2c4006997f99:16000] regionserver.HRegion: Closed > master:store,,1.1595e783b53d99cd5eef43b6debb2682.2020-11-02 23:04:03,903 INFO > [master/2c4006997f99:16000] hbase.ChoreService: Chore service for: > 2c4006997f99:16000.splitLogManager. had [] on shutdown2020-11-02 23:04:03,904 > INFO [master:store-WAL-Roller] wal.AbstractWALRoller: LogRoller > exiting.2020-11-02 23:04:03,998 INFO > [ReadOnlyZKClient-252.17.1.5:2181@0x0432f3aa] zookeeper.ZooKeeper: Session: > 0x10138d353e5001b closed2020-11-02 23:04:03,998 INFO > [ReadOnlyZKClient-252.17.1.5:2181@0x0432f3aa-EventThread] > zookeeper.ClientCnxn: EventThread shut down for session: > 0x10138d353e5001b2020-11-02 23:04:04,098 INFO [master/2c4006997f99:16000] > zookeeper.ZooKeeper: Session: 0x10138d353e50018 closed2020-11-02 23:04:04,098 > INFO [master/2c4006997f99:16000] regionserver.HRegionServer: Exiting; > stopping=2c4006997f99,16000,1604358237412; zookeeper connection > closed.2020-11-02 23:04:04,098 INFO [main-EventThread] zookeeper.ClientCnxn: > EventThread shut down for session: 0x10138d353e500182020-11-02 23:04:04,099 > ERROR [main] master.HMasterCommandLine: Master > exitingjava.lang.RuntimeException: HMaster Aborted at > org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:244) > at > org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:140) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at > org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149) > at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:3072) > {code} > > It can be reproduced through the following steps: > # Start up a cluster of version 2.0.0-RC0 with 3 nodes > # Use hbase pe to write data. > # > {code:java} > /hbase/bin/hbase pe --nomapred --oneCon=true --valueSize=10 --rows=100 > sequentialWrite 1 > {code} > # Stop the cluster: > ## using the graceful_stop.sh to stop all regionservers. > ## Then run stop-hbase.sh > # Upgrade the node to branch-2.3 > # After upgraded, as the log, hbase--master-2c4006997f99.log, suggested, > HMaster failed to start. -- This message was sent by Atlassian Jira (v8.3.4#803005)