joshelser commented on pull request #2113: URL: https://github.com/apache/hbase/pull/2113#issuecomment-662157484
> Correct me if I'm wrong, if an unknown server join on a hbase cluster, since it's not online or dead, assignment manager should not take those online regions as consideration and serving request. or is this not true? So, I feel like you're asking a different question than what I was concerned about (I was concerned about making sure all "old" RegionServers are actually down before we reassign regions onto new servers). This worries me because we rely on SCP's which we are acknowledging are gone in this scenario. Do we just have to make an external "requirement" that the system stopping old hardware ensures all previous RegionServers are fully dead before proceeding with creating new ones that point at the same data? To the question you asked, what is the definition of an "unknown" server in your case: a ServerName listed in meta as a region's assigned location which is not in the AssignmentManagers set of live RegionServers? If that's the case, yes, that's how I understand AM to work today -- the presence of an "unknown" server as an assignment indicates a failure in the system. That is, we lost a MasterProcWAL which had an SCP for a RS. I think that's why this is a "fix by hand" kind of scenario today. > Basically, DEAD and UNKNOWN_SERVER are different in branch-2 and master branch, which in ServerManager we only track onlineServers and deadServer and I didn't find any transition that a UNKNOWN_SERVER could be moved to DEAD or ONLINE. Yup, that's the crux of it. This has been a nagging problem in the back of my mind that turns my stomach. This is what I think the situation is (to try to come up with some common terminology): 1. hbase:meta has assigned regions to a set of RegionServers `rs1` 2. All hosts of `rs1` are shutdown and destroyed (i.e. meta still contains references to them) 3. A new set of RegionServers are created, `rs2`, which have completely unique hostnames to `rs1` 4. All MasterProcWALs from the cluster with `rs1` are lost. For HBase's consistency/safety, before we start any RS in `rs2`, we want to make sure all RS in `rs1` are completely "down". That is, no RS in `rs1` can be allowed to accept any more writes. What I'm wondering is, can we do something within HBase (without relying on whomever is controlling that infrastructure) to allow RS in `rs2` to start coming up? When can we be sure that an UNKNOWN_SERVER is actually dead? By definition, we don't know the state of it. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org