[
https://issues.apache.org/jira/browse/HBASE-21576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Shelukhin resolved HBASE-21576.
--------------------------------------
Resolution: Not A Problem
> master should proactively reassign meta when killing a RS with it
> -----------------------------------------------------------------
>
> Key: HBASE-21576
> URL: https://issues.apache.org/jira/browse/HBASE-21576
> Project: HBase
> Issue Type: Bug
> Reporter: Sergey Shelukhin
> Priority: Major
>
> Master has killed an RS that was hosting meta due to some HDFS issue (most
> likely; I've lost the RS logs due to HBASE-21575).
> RS took a very long time to die (again, might be a separate bug, I'll file if
> I see repro), and a long time to restart; meanwhile master never tried to
> reassign meta, and eventually killed itself not being able to update it.
> It seems like a RS on a bad machine would be especially prone to slow
> abort/startup, as well as to issues causing master to kill it, so it would
> make sense for master to immediately relocate meta once meta-hosting RS is
> dead after a kill; or even when killing the RS. In the former case (if the RS
> needs to die for meta to be reassigned safely), perhaps the RS hosting meta
> in particular should try to die fast in such circumstances, and not do any
> cleanup.
> {noformat}
> 2018-12-08 04:52:55,144 WARN
> [RpcServer.default.FPBQ.Fifo.handler=39,queue=4,port=17000]
> master.MasterRpcServices: <server1>,17020,1544264858183 reported a fatal
> error:
> ***** ABORTING region server <server1>,17020,1544264858183: Replay of WAL
> required. Forcing server shutdown *****
> .... [aborting for ~7 minutes]
> 2018-12-08 04:53:44,190 INFO [PEWorker-7] client.RpcRetryingCallerImpl: Call
> exception, tries=6, retries=61, started=41190 ms ago, cancelled=false,
> msg=org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server
> <server1>,17020,1544264858183 aborting, details=row '...' on table
> 'hbase:meta' at region=hbase:meta,,1.1588230740,
> hostname=<server1>,17020,1544264858183, seqNum=-1
> ... [starting for ~5]
> 2018-12-08 04:59:58,574 INFO
> [RpcServer.default.FPBQ.Fifo.handler=45,queue=0,port=17000]
> client.RpcRetryingCallerImpl: Call exception, tries=10, retries=61,
> started=392702 ms ago, cancelled=false, msg=Call to <server1> failed on
> connection exception:
> org.apache.hbase.thirdparty.io.netty.channel.ConnectTimeoutException:
> connection timed out: <server1>, details=row '...' on table 'hbase:meta' at
> region=hbase:meta,,1.1588230740, hostname=<server1>,17020,1544264858183,
> seqNum=-1
> ... [re-initializing for at least ~7]
> 2018-12-08 05:04:17,271 INFO [hconnection-0x4d58bcd4-shared-pool3-t1877]
> client.RpcRetryingCallerImpl: Call exception, tries=6, retries=61,
> started=41137 ms ago, cancelled=false,
> msg=org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server
> <server1>,17020,1544274145387 is not running yet
> ...
> 2018-12-08 05:11:18,470 ERROR
> [RpcServer.default.FPBQ.Fifo.handler=38,queue=3,port=17000] master.HMaster:
> ***** ABORTING master ...,17000,1544230401860: FAILED persisting region=...
> state=OPEN *****^M
> {noformat}
> There are no signs of meta assignment activity at all in master logs
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)