[ https://issues.apache.org/jira/browse/HBASE-21576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sergey Shelukhin resolved HBASE-21576. -------------------------------------- Resolution: Not A Problem > master should proactively reassign meta when killing a RS with it > ----------------------------------------------------------------- > > Key: HBASE-21576 > URL: https://issues.apache.org/jira/browse/HBASE-21576 > Project: HBase > Issue Type: Bug > Reporter: Sergey Shelukhin > Priority: Major > > Master has killed an RS that was hosting meta due to some HDFS issue (most > likely; I've lost the RS logs due to HBASE-21575). > RS took a very long time to die (again, might be a separate bug, I'll file if > I see repro), and a long time to restart; meanwhile master never tried to > reassign meta, and eventually killed itself not being able to update it. > It seems like a RS on a bad machine would be especially prone to slow > abort/startup, as well as to issues causing master to kill it, so it would > make sense for master to immediately relocate meta once meta-hosting RS is > dead after a kill; or even when killing the RS. In the former case (if the RS > needs to die for meta to be reassigned safely), perhaps the RS hosting meta > in particular should try to die fast in such circumstances, and not do any > cleanup. > {noformat} > 2018-12-08 04:52:55,144 WARN > [RpcServer.default.FPBQ.Fifo.handler=39,queue=4,port=17000] > master.MasterRpcServices: <server1>,17020,1544264858183 reported a fatal > error: > ***** ABORTING region server <server1>,17020,1544264858183: Replay of WAL > required. Forcing server shutdown ***** > .... [aborting for ~7 minutes] > 2018-12-08 04:53:44,190 INFO [PEWorker-7] client.RpcRetryingCallerImpl: Call > exception, tries=6, retries=61, started=41190 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server > <server1>,17020,1544264858183 aborting, details=row '...' on table > 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=<server1>,17020,1544264858183, seqNum=-1 > ... [starting for ~5] > 2018-12-08 04:59:58,574 INFO > [RpcServer.default.FPBQ.Fifo.handler=45,queue=0,port=17000] > client.RpcRetryingCallerImpl: Call exception, tries=10, retries=61, > started=392702 ms ago, cancelled=false, msg=Call to <server1> failed on > connection exception: > org.apache.hbase.thirdparty.io.netty.channel.ConnectTimeoutException: > connection timed out: <server1>, details=row '...' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=<server1>,17020,1544264858183, > seqNum=-1 > ... [re-initializing for at least ~7] > 2018-12-08 05:04:17,271 INFO [hconnection-0x4d58bcd4-shared-pool3-t1877] > client.RpcRetryingCallerImpl: Call exception, tries=6, retries=61, > started=41137 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server > <server1>,17020,1544274145387 is not running yet > ... > 2018-12-08 05:11:18,470 ERROR > [RpcServer.default.FPBQ.Fifo.handler=38,queue=3,port=17000] master.HMaster: > ***** ABORTING master ...,17000,1544230401860: FAILED persisting region=... > state=OPEN *****^M > {noformat} > There are no signs of meta assignment activity at all in master logs -- This message was sent by Atlassian JIRA (v7.6.3#76005)