[ https://issues.apache.org/jira/browse/HBASE-25032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250035#comment-17250035 ]
Caroline commented on HBASE-25032: ---------------------------------- [~anoop.hbase] Once Master acknowledges the reportForDuty and sends the response back to the RS, RS performs all the actions within [this handleReportForDuty() method|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L1494]. Aside from setting up WAL and replication, this method also sets up an ephemeral znode, initializes file system, sets up metrics, starts service threads, starts heap memory manager, and sets the RS internal 'online' boolean to true, among other tasks. So, it seems that RS is doing a lot of vital and time-consuming setup after reporting for duty to master. Therefore, I believe it makes sense for Master to delay placing the RS into its onlineServers list until after the RS has completed all of above tasks. The approach taken in the PRs is to leave the RS reportForDuty/handleReportForDuty logic as is, and change the Master-side logic so that Master asynchronously polls for the RS's internal 'online' boolean to be set to true before placing the RS into its onlineServers list (this will happen at the end of RS's handleReportForDuty method). The flow looks something like this: RS starts up -> RS sends reportForDuty to Master -> Master acknowledges reportForDuty, sends response to RS; at the same time, Master spawns thread to poll for RS 'online' flag (i.e. RS setup complete) -> RS receives 'reportForDuty received' acknowledgement from Master -> RS finishes setup, sets its 'online' flag to true -> Master sees RS has finished setup -> Master adds RS to Master's onlineServers list. > Wait for region server to become online before adding it to online servers in > Master > ------------------------------------------------------------------------------------ > > Key: HBASE-25032 > URL: https://issues.apache.org/jira/browse/HBASE-25032 > Project: HBase > Issue Type: Bug > Reporter: Sandeep Guggilam > Assignee: Caroline > Priority: Major > > As part of RS start up, RS reports for duty to Master . Master acknowledges > the request and adds it to the onlineServers list for further assigning any > regions to the RS > Once Master acknowledges the reportForDuty and sends back the response, RS > does a bunch of stuff like initializing replication sources etc before > becoming online. However, sometimes there could be an issue with initializing > replication sources when it is unable to connect to peer clusters because of > some kerberos configuration and there would be a delay of around 20 mins in > becoming online. > > Since master considers it online, it tries to assign regions and which fails > with ServerNotRunningYet exception, then the master tries to unassign which > again fails with the same exception leading the region to FAILED_CLOSE state. > > It would be good to have a check to see if the RS is ready to accept the > assignment requests before adding it to online servers list which would > account for any such delays as described above -- This message was sent by Atlassian Jira (v8.3.4#803005)