[ 
https://issues.apache.org/jira/browse/HBASE-25032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250035#comment-17250035
 ] 

Caroline commented on HBASE-25032:
----------------------------------

[~anoop.hbase] Once Master acknowledges the reportForDuty and sends the 
response back to the RS, RS performs all the actions within [this 
handleReportForDuty() 
method|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L1494].
 Aside from setting up WAL and replication, this method also sets up an 
ephemeral znode, initializes file system, sets up metrics, starts service 
threads, starts heap memory manager, and sets the RS internal 'online' boolean 
to true, among other tasks. So, it seems that RS is doing a lot of vital and 
time-consuming setup after reporting for duty to master. 

Therefore, I believe it makes sense for Master to delay placing the RS into its 
onlineServers list until after the RS has completed all of above tasks. The 
approach taken in the PRs is to leave the RS reportForDuty/handleReportForDuty 
logic as is, and change the Master-side logic so that Master asynchronously 
polls for the RS's internal 'online' boolean to be set to true before placing 
the RS into its onlineServers list (this will happen at the end of RS's 
handleReportForDuty method).

The flow looks something like this:

RS starts up -> RS sends reportForDuty to Master -> Master acknowledges 
reportForDuty, sends response to RS; at the same time, Master spawns thread to 
poll for RS 'online' flag (i.e. RS setup complete) -> RS receives 
'reportForDuty received' acknowledgement from Master -> RS finishes setup, sets 
its 'online' flag to true -> Master sees RS has finished setup -> Master adds 
RS to Master's onlineServers list.

> Wait for region server to become online before adding it to online servers in 
> Master
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-25032
>                 URL: https://issues.apache.org/jira/browse/HBASE-25032
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sandeep Guggilam
>            Assignee: Caroline
>            Priority: Major
>
> As part of RS start up, RS reports for duty to Master . Master acknowledges 
> the request and adds it to the onlineServers list for further assigning any 
> regions to the RS
> Once Master acknowledges the reportForDuty and sends back the response, RS 
> does a bunch of stuff like initializing replication sources etc before 
> becoming online. However, sometimes there could be an issue with initializing 
> replication sources when it is unable to connect to peer clusters because of 
> some kerberos configuration and there would be a delay of around 20 mins in 
> becoming online.
>  
> Since master considers it online, it tries to assign regions and which fails 
> with ServerNotRunningYet exception, then the master tries to unassign which 
> again fails with the same exception leading the region to FAILED_CLOSE state.
>  
> It would be good to have a check to see if the RS is ready to accept the 
> assignment requests before adding it to online servers list which would 
> account for any such delays as described above



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to