taklwu edited a comment on pull request #2113:
URL: https://github.com/apache/hbase/pull/2113#issuecomment-662590771


   Thanks Josh, and honestly I didn't know the logic till now. And here is the 
finding for both situations you're concerning:
   
   #### first case
       1. hbase:meta has assigned regions to a set of RegionServers rs1
       2. All hosts of rs1 are shutdown and destroyed (i.e. meta still contains 
references to them)
       3. A new set of RegionServers are created, rs2, which have completely 
unique hostnames to rs1
       4. All MasterProcWALs from the cluster with rs1 are lost.
   
   #### second case
       1. I have a healthy cluster (1 master, many RS)
       2. I stop the master
       3. I kill one RS
          3a. I do not restart that RS
       4. I restart the master
   
   There is three Key parts in the normal system to handle `region server has 
been deleted`, MasterProcWALs/MasterRegion for `DEAD` server being tracked by 
SCP, Region servers name exists in WAL for `possibly live` servers.  
   
   If MasterProcWALs/MasterRegion both exist after a cluster restarts, when 
`RegionServerTracker` starts, `RegionServerTracker` figures out all online 
servers, and if we don't have Znode (with same hostname when restart?) for 
`possibly live` servers, marked they are dead and scheduled SCP for it as well 
as continue the SCP for already dead servers. That would be normal cases.
   
   
   ```
   2020-07-22 09:55:24,729 INFO  [master/localhost:0:becomeActiveMaster] 
master.RegionServerTracker(123): Starting RegionServerTracker; 0 have existing 
ServerCrashProcedures, 3 possibly 'live' servers, and 0 'splitting'.
   2020-07-22 09:55:24,730 DEBUG [master/localhost:0:becomeActiveMaster] 
zookeeper.RecoverableZooKeeper(183): Node 
/hbase/draining/localhost,55572,1595436917066 already deleted, retry=false
   2020-07-22 09:55:24,730 INFO  [master/localhost:0:becomeActiveMaster] 
master.ServerManager(585): Processing expiration of 
localhost,55572,1595436917066 on localhost,55667,1595436924374
   2020-07-22 09:55:24,755 DEBUG [master/localhost:0:becomeActiveMaster] 
procedure2.ProcedureExecutor(1050): Stored pid=12, 
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
localhost,55572,1595436917066, splitWal=true, meta=true
   ```
   
   Then in the case of deleting MasterProcWALs (or MasterRegion in branch-2.3+) 
and kept the ZK nodes, even there is no procedure MasterProcWALs restored from, 
as long as we have the WAL from for previous host, we can still schedule SCP 
for it. but if MasterProcWALs and WAL are deleted, neither of the first and 
second cases will not operating normally. 
   
   The case we were originally trying to solve that is falling into the 
situation of MasterProcWALs and WAL are deleted after cluster restarted, we 
don't have the WAL, MasterProcWALs/MasterRegion and Zookeeper but HFiles, then 
those servers are under unknown and regions cannot be reassigned. 
   
   ----
   
   About the unit tests failure, Now....I'm hitting a strange issue, my tests 
works fine if I delete WAL, MasterProcWALs, and ZK baseZNode in branch-2.2. 
However, with the same setup in branch-2.3+ and master will hangs the master 
initialization if the ZK baseZNode is deleted with or without my changes. (what 
has been changed in branch-2.3? I found MasterRegion but not sure why that's 
related to ZK data, is it a bug? )
   
   Interestingly, my fix works if keep the baseZnode, so, I'm trying to figure 
out a right way to cleanup zookeeper such it matched the one of the cloud use 
cases that WAL on HDFS and ZK are also deleted when HBase cluster terminated. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to