[ 
https://issues.apache.org/jira/browse/HBASE-21576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin resolved HBASE-21576.
--------------------------------------
    Resolution: Not A Problem

> master should proactively reassign meta when killing a RS with it
> -----------------------------------------------------------------
>
>                 Key: HBASE-21576
>                 URL: https://issues.apache.org/jira/browse/HBASE-21576
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Major
>
> Master has killed an RS that was hosting meta due to some HDFS issue (most 
> likely; I've lost the RS logs due to HBASE-21575).
> RS took a very long time to die (again, might be a separate bug, I'll file if 
> I see repro), and a long time to restart; meanwhile master never tried to 
> reassign meta, and eventually killed itself not being able to update it.
> It seems like a RS on a bad machine would be especially prone to slow 
> abort/startup, as well as to issues causing master to kill it, so it would 
> make sense for master to immediately relocate meta once meta-hosting RS is 
> dead after a kill; or even when killing the RS. In the former case (if the RS 
> needs to die for meta to be reassigned safely), perhaps the RS hosting meta 
> in particular should try to die fast in such circumstances, and not do any 
> cleanup.
> {noformat}
> 2018-12-08 04:52:55,144 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=39,queue=4,port=17000] 
> master.MasterRpcServices: <server1>,17020,1544264858183 reported a fatal 
> error:
> ***** ABORTING region server <server1>,17020,1544264858183: Replay of WAL 
> required. Forcing server shutdown *****
> .... [aborting for ~7 minutes]
> 2018-12-08 04:53:44,190 INFO  [PEWorker-7] client.RpcRetryingCallerImpl: Call 
> exception, tries=6, retries=61, started=41190 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server 
> <server1>,17020,1544264858183 aborting, details=row '...' on table 
> 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=<server1>,17020,1544264858183, seqNum=-1
> ... [starting for ~5]
> 2018-12-08 04:59:58,574 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=45,queue=0,port=17000] 
> client.RpcRetryingCallerImpl: Call exception, tries=10, retries=61, 
> started=392702 ms ago, cancelled=false, msg=Call to <server1> failed on 
> connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.ConnectTimeoutException: 
> connection timed out: <server1>, details=row '...' on table 'hbase:meta' at 
> region=hbase:meta,,1.1588230740, hostname=<server1>,17020,1544264858183, 
> seqNum=-1
> ... [re-initializing for at least ~7]
> 2018-12-08 05:04:17,271 INFO  [hconnection-0x4d58bcd4-shared-pool3-t1877] 
> client.RpcRetryingCallerImpl: Call exception, tries=6, retries=61, 
> started=41137 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server 
> <server1>,17020,1544274145387 is not running yet
> ...
> 2018-12-08 05:11:18,470 ERROR 
> [RpcServer.default.FPBQ.Fifo.handler=38,queue=3,port=17000] master.HMaster: 
> ***** ABORTING master ...,17000,1544230401860: FAILED persisting region=... 
> state=OPEN *****^M
> {noformat}
> There are no signs of meta assignment activity at all in master logs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to