[ https://issues.apache.org/jira/browse/HBASE-22041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110623#comment-17110623 ]
Andrey Elenskiy commented on HBASE-22041: ----------------------------------------- Just reproduced again and I'm seeing ServerCrashProcedure being stuck for the regionserver that it's trying to reconnect to with state=WAITING:SERVER_CRASH_FINISH. And, ServerCrashProcedure is waiting for TransitRegionStateProcedure procedure with state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED. And, TransitRegionStateProcedure is waiting for OpenRegionProcedure procedure with state=RUNNABLE Regions in transition are in OPENNING state for the regionserver that exists. If I'm understanding the logs correctly, it's trying to connect to old IP address for the restarted regionserver. In kubernetes when pod is restarted it gets a new IP address and preserves hostname (if it's a statefulset). So, there's some assumption somewhere in HBase that IP address doesn't change or it caches the IP address resolution. In this particular case it looks like it's trying to correctly assign regions to the online regionserver but still uses old IP address. > The crashed node exists in onlineServer forever, and if it holds the meta > data, master will start up hang. > ---------------------------------------------------------------------------------------------------------- > > Key: HBASE-22041 > URL: https://issues.apache.org/jira/browse/HBASE-22041 > Project: HBase > Issue Type: Bug > Reporter: lujie > Priority: Critical > Attachments: bug.zip, normal.zip > > > while master fresh boot, we crash (kill- 9) the RS who hold meta. we find > that the master startup fails and print thounds of logs like: > {code:java} > 2019-03-13 01:09:54,896 WARN [RSProcedureDispatcher-pool4-t1] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to java.net.ConnectException: Call to > hadoop14/172.16.1.131:16020 failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > syscall:getsockopt(..) failed: Connection refused: > hadoop14/172.16.1.131:16020, try=0, retrying... > 2019-03-13 01:09:55,004 WARN [RSProcedureDispatcher-pool4-t2] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=1, retrying... > 2019-03-13 01:09:55,114 WARN [RSProcedureDispatcher-pool4-t3] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=2, retrying... > 2019-03-13 01:09:55,219 WARN [RSProcedureDispatcher-pool4-t4] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=3, retrying... > 2019-03-13 01:09:55,324 WARN [RSProcedureDispatcher-pool4-t5] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=4, retrying... > 2019-03-13 01:09:55,428 WARN [RSProcedureDispatcher-pool4-t6] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=5, retrying... > 2019-03-13 01:09:55,533 WARN [RSProcedureDispatcher-pool4-t7] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=6, retrying... > 2019-03-13 01:09:55,638 WARN [RSProcedureDispatcher-pool4-t8] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=7, retrying... > 2019-03-13 01:09:55,755 WARN [RSProcedureDispatcher-pool4-t9] > procedure.RSProcedureDispatcher: request to server > hadoop14,16020,1552410583724 failed due to > org.apache.hadoop.hbase.ipc.FailedServerException: Call to > hadoop14/172.16.1.131:16020 failed on local exception: > org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the > failed servers list: hadoop14/172.16.1.131:16020, try=8, retrying... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)