joshelser commented on pull request #2113:
URL: https://github.com/apache/hbase/pull/2113#issuecomment-662157484


   > Correct me if I'm wrong, if an unknown server join on a hbase cluster, 
since it's not online or dead, assignment manager should not take those online 
regions as consideration and serving request. or is this not true?
   
   So, I feel like you're asking a different question than what I was concerned 
about (I was concerned about making sure all "old" RegionServers are actually 
down before we reassign regions onto new servers). This worries me because we 
rely on SCP's which we are acknowledging are gone in this scenario. Do we just 
have to make an external "requirement" that the system stopping old hardware 
ensures all previous RegionServers are fully dead before proceeding with 
creating new ones that point at the same data?
   
   To the question you asked, what is the definition of an "unknown" server in 
your case: a ServerName listed in meta as a region's assigned location which is 
not in the AssignmentManagers set of live RegionServers? If that's the case, 
yes, that's how I understand AM to work today -- the presence of an "unknown" 
server as an assignment indicates a failure in the system. That is, we lost a 
MasterProcWAL which had an SCP for a RS. I think that's why this is a "fix by 
hand" kind of scenario today.
   
   > Basically, DEAD and UNKNOWN_SERVER are different in branch-2 and master 
branch, which in ServerManager we only track onlineServers and deadServer and I 
didn't find any transition that a UNKNOWN_SERVER could be moved to DEAD or 
ONLINE.
   
   Yup, that's the crux of it.
   
   This has been a nagging problem in the back of my mind that turns my 
stomach. This is what I think the situation is (to try to come up with some 
common terminology):
   
   1. hbase:meta has assigned regions to a set of RegionServers `rs1`
   2. All hosts of `rs1` are shutdown and destroyed (i.e. meta still contains 
references to them)
   3. A new set of RegionServers are created, `rs2`, which have completely 
unique hostnames to `rs1`
   4. All MasterProcWALs from the cluster with `rs1` are lost.
   
   For HBase's consistency/safety, before we start any RS in `rs2`, we want to 
make sure all RS in `rs1` are completely "down". That is, no RS in `rs1` can be 
allowed to accept any more writes. What I'm wondering is, can we do something 
within HBase (without relying on whomever is controlling that infrastructure) 
to allow RS in `rs2` to start coming up? When can we be sure that an 
UNKNOWN_SERVER is actually dead? By definition, we don't know the state of it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to