[
https://issues.apache.org/jira/browse/HBASE-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283349#comment-13283349
]
ramkrishna.s.vasudevan commented on HBASE-5916:
-----------------------------------------------
@Chunhui
I like your idea too. As i said we are planning to raise an improvement
activity for master restart and SSH.
Because even with the above approach i will tell one more scenario which is
problematic. Pls note that the scenario can come even without your suggestion
also.
Two region servers are there. Both went down when the flow is in
AM.joinCluster(). Now as no RS is there at that time we will not make any
assignment. And all will go into RIT mode waiting for timeout monitor. Now SSH
is also waiting as the master initialization is not complete(this step is as
per your suggestion). Now suppose there are 100 regions all are waiting for
getting assigned.
Now if a new RS comes up as there is a code in TimeoutMonitor
{code}
if (regionState.getStamp() + timeout <= now) {
//decide on action upon timeout
actOnTimeOut(regionState);
} else if (this.allRegionServersOffline && !allRSsOffline) {
// if some RSs just came back online, we can start the
// the assignment right away
actOnTimeOut(regionState);
}
{code}
It will immediately trigger assignment. At the same time as master
initialization has already been done and so we are able to carry on assignment
with SSH also. This will lead to double assignment. Actually in defect
HBASe-5816 Stack was suggesting to have one common queue where any assignment
will be done so that SSH will not interfere with that or viceversa.
I suggest we can get in the patch that addresses the current JIRa problem and
work on a diff JIRA that will help me to address the master restart and SSH
area which is troublesome.
> RS restart just before master intialization we make the cluster non operative
> -----------------------------------------------------------------------------
>
> Key: HBASE-5916
> URL: https://issues.apache.org/jira/browse/HBASE-5916
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.92.1, 0.94.0
> Reporter: ramkrishna.s.vasudevan
> Assignee: ramkrishna.s.vasudevan
> Priority: Critical
> Fix For: 0.94.1
>
> Attachments: HBASE-5916_trunk.patch, HBASE-5916_trunk_1.patch,
> HBASE-5916_trunk_1.patch, HBASE-5916_trunk_2.patch, HBASE-5916_trunk_3.patch,
> HBASE-5916_trunk_4.patch, HBASE-5916_trunk_v5.patch,
> HBASE-5916_trunk_v6.patch, HBASE-5916_trunk_v7.patch, HBASE-5916v8.patch
>
>
> Consider a case where my master is getting restarted. RS that was alive when
> the master restart started, gets restarted before the master initializes the
> ServerShutDownHandler.
> {code}
> serverShutdownHandlerEnabled = true;
> {code}
> In this case when the RS tries to register with the master, the master will
> try to expire the server but the server cannot be expired as still the
> serverShutdownHandler is not enabled.
> This case may happen when i have only one RS gets restarted or all the RS
> gets restarted at the same time.(before assignRootandMeta).
> {code}
> LOG.info(message);
> if (existingServer.getStartcode() < serverName.getStartcode()) {
> LOG.info("Triggering server recovery; existingServer " +
> existingServer + " looks stale, new server:" + serverName);
> expireServer(existingServer);
> }
> {code}
> If another RS is brought up then the cluster comes back to normalcy.
> May be a very corner case.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira