[ 
https://issues.apache.org/jira/browse/HBASE-19768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433369#comment-16433369
 ] 

chenxu commented on HBASE-19768:
--------------------------------

The reason is like this, in our test env, if i kill the a DN, the RS on the 
same host will roll the WAL.
when create a new WAL, if connect to local DN failed, IOException is throw.
in catch block overwrite variable is set to true, and recoverFileLease is 
execute in the finally block.
but Lease recover will fail:
logs in RS like this: util.FSHDFSUtils: Failed to recover lease, attempt=0...
logs in NN like this: File ... has not been closed. Lease recovery is in 
progress
request to the RS will blocked a while, if bypass the Lease recover, there will 
be no block
hope you can follow

> RegionServer startup failing when DN is dead
> --------------------------------------------
>
>                 Key: HBASE-19768
>                 URL: https://issues.apache.org/jira/browse/HBASE-19768
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Marc Spaggiari
>            Assignee: Duo Zhang
>            Priority: Critical
>             Fix For: 2.0.0-beta-2, 2.0.0
>
>         Attachments: HBASE-19768.patch
>
>
> When starting HBase, if the datanode hosted on the same host is dead but not 
> yet detected by the namenode, HBase will fail to start
> {code}
> 515691223393/node8.distparser.com%2C16020%2C1515691223393.1515691238778 
> failed, retry = 7
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  syscall:getsockopt(..) failed: Connexion refusée: /192.168.23.2:50010
>       at 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Socket.finishConnect(..)(Unknown
>  Source)
> Caused by: 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeConnectException:
>  syscall:getsockopt(..) failed: Connexion refusée
>       ... 1 more
> {code}
> and will also get stuck to stop:
> {code}
> hbase@node2:~/hbase-2.0.0-beta-1$ bin/stop-hbase.sh 
> stopping 
> hbase....................................................................................................................................................................................................^C
> hbase@node2:~/hbase-2.0.0-beta-1$ bin/stop-hbase.sh 
> stopping 
> hbase..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/hbase/hbase-2.0.0-beta-1/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/hbase/hbase-2.0.0-beta-1/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> {code}
> The most interesting is that it seems to fail the same way even if the DN is 
> declared dead on HDFS side:
> {code}
> 515692041367/node8.distparser.com%2C16020%2C1515692041367.1515692057716 
> failed, retry = 4
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  syscall:getsockopt(..) failed: Connexion refusée: /192.168.23.2:50010
>       at 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Socket.finishConnect(..)(Unknown
>  Source)
> Caused by: 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeConnectException:
>  syscall:getsockopt(..) failed: Connexion refusée
>       ... 1 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to