[ https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119827#comment-17119827 ]
Michael Stack commented on HBASE-22041: --------------------------------------- bq. I was able to reproduce this issue with ttl=1 as well as ttl=0 (which I guess means no caching). Ouch. Any evidence that java process ever notices the DNS update? Next I think would be looking at throwing away the Connection after N retries. We keep trying for ever.... waiting on an SCP. If can create a new Connection and keep going w/o disrupting any other ongoing RPCs., that sounds like way to go here. Thanks [~timoha] > [k8s] The crashed node exists in onlineServer forever, and if it holds the > meta data, master will start up hang. > ---------------------------------------------------------------------------------------------------------------- > > Key: HBASE-22041 > URL: https://issues.apache.org/jira/browse/HBASE-22041 > Project: HBase > Issue Type: Bug > Reporter: lujie > Priority: Critical > Attachments: bug.zip, hbasemaster.log, normal.zip > > > while master fresh boot, we crash (kill- 9) the RS who hold meta. we find > that the master startup fails and print thounds of logs like: > {code:java} > 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to > hadoop14/172.16.1.131:16020 failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > syscall:getsockopt(..) failed: Connection refused: > hadoop14/172.16.1.131:16020, try=0, retrying... > 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying... > 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying... > 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying... > 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying... > 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying... > 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying... > 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=7, retrying... > 2019-03-13 01:09:55,755 WARN [RSProcedureDispatcher-pool4-t9] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=8, retrying... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)