[ 
https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hsieh updated HBASE-6626:
----------------------------------

       Resolution: Fixed
    Fix Version/s: 2.0.0
                   0.99.0
     Hadoop Flags: Reviewed
           Status: Resolved  (was: Patch Available)

> Add a chapter on HDFS in the troubleshooting section of the HBase reference 
> guide.
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-6626
>                 URL: https://issues.apache.org/jira/browse/HBASE-6626
>             Project: HBase
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.95.2
>            Reporter: Nicolas Liochon
>            Assignee: Misty Stanley-Jones
>            Priority: Blocker
>             Fix For: 0.99.0, 2.0.0
>
>         Attachments: HBASE-6626.patch, troubleshooting.txt
>
>
> I looked mainly at the major failure case, but here is what I have:
> New sub chapter in the existing chapter "Troubleshooting and Debugging 
> HBase": "HDFS & HBASE"
> 1) HDFS & HBase
> 2) Connection related settings
> 2.1) Number of retries
> 2.2) Timeouts
> 3) Log samples
> 1) HDFS & HBase
> HBase uses HDFS to store its HFile, i.e. the core HBase files and the 
> Write-Ahead-Logs, i.e. the files that will be used to restore the data after 
> a crash.
> In both cases, the reliability of HBase comes from the fact that HDFS writes 
> the data to multiple locations. To be efficient, HBase needs the data to be 
> available locally, hence it's highly recommended to have the HDFS datanode on 
> the same machines as the HBase Region Servers.
> Detailed information on how HDFS works can be found at [1].
> Important features are:
>  - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. 
> This class can appears in HBase logs with other HDFS client related logs.
>  - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS 
> side, while some other are HDFS-client-side, i.e. must be set in HBase, while 
> some other must be set in both places.
>  - the HDFS writes are pipelined from one datanode to another. When writing, 
> there are communications between:
>     - HBase and HDFS namenode, through the HDFS client classes.
>     - HBase and HDFS datanodes, through the HDFS client classes.
>     - HDFS datanode between themselves: issues on these communications are in 
> HDFS logs, not HBase. HDFS writes are always local when possible. As a 
> consequence, there should not be much write error in HBase Region Servers: 
> they write to the local datanode. If this datanode can't replicate the 
> blocks, it will appear in its logs, not in the region servers logs.
>  - datanodes can be contacted through the ipc.Client interface (once again 
> this class can shows up in HBase logs) and the data transfer interface 
> (usually shows up as the DataNode class in the HBase logs). There are on 
> different ports (defaults being: 50010 and 50020).
>  - To understand exactly what's going on, you must look that the HDFS log 
> files as well: HBase logs represent the client side.
>  - With the default setting, HDFS needs 630s to mark a datanode as dead. For 
> this reason, this node will still be tried by HBase or by other datanodes 
> when writing and reading until HDFS definitively decides it's dead. This will 
> add some extras lines in the logs. This monitoring is performed by the 
> NameNode.
>  - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on 
> the NameNode, but can mark temporally a node as dead if they had an error 
> when they tried to use it.
> 2) Settings for retries and timeouts
> 2.1) Retries
> ipc.client.connect.max.retries
> Default 10
> Indicates the number of retries a client will make to establish a server 
> connection. Not taken into account if the error is a SocketTimeout. In this 
> case the number of retries is 45 (fixed on branch, HADOOP-7932 or in 
> HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be 
> increased, especially if the socket timeouts have been lowered.
> ipc.client.connect.max.retries.on.timeouts
> Default 45
> If you have HADOOP-7932, max number of retries on timeout. Counter is 
> different than ipc.client.connect.max.retries so if you mix the socket errors 
> you will get 55 retries with the default values. Could be lowered, once it is 
> available. With HADOOP-7397 ipc.client.connect.max.retries is reused so there 
> would be 10 tries.
> dfs.client.block.write.retries
> Default 3
> Number of tries for the client when writing a block. After a failure, will 
> connect to the namenode a get a new location, sending the list of the 
> datanodes already tried without success. Could be increased, especially if 
> the socket timeouts have been lowered. See HBASE-6490.
> dfs.client.block.write.locateFollowingBlock.retries
> Default 5
> Number of retries to the namenode when the client got 
> NotReplicatedYetException, i.e. the existing nodes of the files are not yet 
> replicated to dfs.replication.min. This should not impact HBase, as 
> dfs.replication.min is defaulted to 1.
> dfs.client.max.block.acquire.failures
> Default 3
> Number of tries to read a block from the datanodes list. In other words, if 5 
> datanodes are supposed to hold a block (so dfs.replication equals to 5), the 
> client will try all these datanodes, then check the value of 
> dfs.client.max.block.acquire.failures to see if it should retry or not. If 
> so, it will get a new list (likely the same), and will try to reconnect again 
> to all these 5 datanodes. COuldbe be increased, especially if the socket 
> timeouts have been lowered.
>  
> 2.2) Timeouts
> 2.3.1) Heatbeats
> dfs.heartbeat.interval
>  Default is 3s
> heartbeat.recheck.interval = 300s
>  Defaults is 300S
> A datanode is considered as dead when there is no heartbeat for (2 * 
> heartbeat.recheck.interval + 10 * dfs.heartbeat.interval) seconds. That's 
> 630s.  So before the 10:30 minutes, the datanode is considered as fully 
> available by the namenode.  After this delay, HDFS is likely to start 
> replicating the blocks contained in the dead node to get back to the right 
> number of replica. As a consequence, if we're too aggressive we will have a 
> side effect here, adding workload to an already damaged cluster. For this 
> reason it's not recommended to change these settings.
> As there are communications between the datanodes, and as they share these 
> settings, these settings are both HDFS-client-side and HDFS-server-side.
> 2.3.2) Socket timeouts
> 3 timeouts are considered in HDFS:
>  - connect timeout: the timeout when we tried to establish the connection
>  - read timeout: the timeout when we read something on an already established 
> connection
>  - write timeout: the timeout when we try to write something on an already 
> established connection.
> They are managed by two settings:
> dfs.socket.timeout
> Default 60s
> dfs.datanode.socket.write.timeout
> Default is 480s.
> But these setting are used:
> - between the DFSClient and the datanode
> - between the ipc.Client and the datanodes
> - Between the datanodes
> - sometimes but not always with an extension (depending on the number of 
> replica)
> - for dfs.socket.timeout as a socket connect timeout but as well as a socket 
> read timeout.
> - for dfs.datanode.socket.write.timeout, when it's set to 0, a plain old java 
> socket is created in some cases instead of a NIO.
> final calculated connect timeout can be:
>  hard-coded to 20s for the the ipc.Client in Hadoop 1.0.3 (changed in 
> HADOOP-7397)
>  dfs.socket.timeout  (ex: DataNode#DataTransfer, DataXceiver#replaceBlock)
>  dfs.socket.timeout + 3s*#replica  (ex: DataXceiver#write, 
> DFSClient#getFileChecksum called from FileCheckSumServlet)
> final read timeouts can be:
>  dfs.socket.timeout  (DataXceiver#replaceBlock, ipc.Client from DFSClient)
>  dfs.socket.timeout +  3s*#replica  (ex: DataNode#DataTransfer, 
> DataXceiver#write)
>  dfs.socket.timeout * #replica (ex: DataNode#DataTransfer)
>  
> final calculated write timeouts can be:
>  dfs.datanode.socket.write.timeout (ex DataXceiver#copyBlock/readBlock/...)
>  dfs.datanode.socket.write.timeout + 5s*#replica) (ex 
> DFSClient#createBlockOutputStream, DataXceiver#writeBlock)
>  dfs.datanode.socket.write.timeout + 5s*(#replica -1) (ex: 
> DataNode#DataTransfer. See HADOOP-5464).
> Hence we will often see a 69000 timeout in the logs before the datanode is 
> marked dead/excluded. Also, setting "dfs.socket.timeout" to 0 does not make 
> it wait forever, but likely 9 seconds instead of 69s for data transfer.
> 3) Typical error logs.
> 3.1) Typical logs when all datanode for a block are dead, making the HBase 
> recovery impossible. HBase master logs will contain, with a 0.90 HBase:
> INFO HDFS.DFSClient: Failed to connect to /xxx50010, add to deadNodes and 
> continue java.net.SocketTimeoutException: 60000 millis timeout while waiting 
> for channel to be ready for connect. ch : 
> java.nio.channels.SocketChannel[connection-pending 
> remote=/region-server-1:50010]
> => The client tries to connect to a dead datanode.
> => It failed, so the client will try the next datanode in the list. Usually 
> the list size is 3 (dfs.replication).
> => If the final list is empty, it means that all the datanodes proposed by 
> the namenode are in our datanodes list.
> => The HDFS client clears the dead nodes list and sleeps 3 seconds 
> (hard-coded), shallowing InterruptedException, and asks again to the 
> namenode. This is the log line:
> INFO HDFS.DFSClient: Could not obtain block blk_xxx from any node: 
> java.io.IOException: No live nodes contain current block. Will get new block 
> locations from namenode and retry...
> => All the locations initially given by the namenode to this client are 
> actually dead. The client asks for a new set of locations.
> => We're very likely to have exactly the same datanode list as 3 seconds ago, 
> except if a Datanode came back to life or if a replication has just finished.
> => After dfs.client.max.block.acquire.failures (default: 3), an exception is 
> thrown, then logged, and we have in the logs:
> WARN HDFS.DFSClient: DFS Read: java.io.IOException: Could not obtain block: 
> blk_xxx file=/hbase/.logs/boxxxx,60020,xxx/xxx%3A60020.yyy
> => There is another retry, hard-coded to 2, but this is logged only once, 
> even if the second try fails.
> => Moreover, for the second try the errors counters are not reinitialized, 
> including the dead nodes list, so this second attempt is unlikely to succeed. 
> It should come again with an empty node list, and throw a new 
> java.io.IOException: Could not obtain block: blk_xxx 
> file=/hbase/.logs/boxxxx,60020,xxx/xxx%3A60020.yyy
> => This exception will go to the final client (hbase). HBase will log it, and 
> we will see
> INFO wal.HLogSplitter: Got while parsing hlog 
> HDFS://namodenode:8020//hbase/.logs/boxxxx,60020,xxx/xxx60020.yyy. Marking as 
> corrupted java.io.IOException: Could not obtain block: blk_xxx 
> file=/hbase/.logs/boxxxx,60020,xxx/xxx60020.yyy
> 3.2) Typical log for write issues: the master reads the log, then wants to 
> split it, hence writing a block:
> INFO org.apache.hadoop.HDFS.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: 69000 millis timeout while waiting for 
> channel to be ready for connect. ch : 
> java.nio.channels.SocketChannel[connection-pending remote=/ xxx:50010]
> => We tried to connect to the dead datanode to write. Likely from the master 
> (it does not have a datanode, so it connects to a remote datanode).
> => A region server will not have this type of error, as it connects to a 
> local datanode to write.
> => It failed at the beginning, we cannot connect at all (i.e. not during the 
> write itself)
> INFO HDFS.DFSClient: Abandoning block blk_xxx
> => HBase (as a HDFS client) told to the namenode that the block is not 
> written.
> INFO HDFS.DFSClient: Excluding datanode xxx:50010
> => Internally in HDFS client the stream puts it in the excludedNodes list 
> (the "Excluding datanode" log line ).
> => The HDFS client is going again to the namenode asking for another datanode 
> set proposal, sending the excluded datanode list to be sure it's not trying 
> on the same nodes again.
> => There will be 3 retries by default. If you've lost 20% of your cluster 1% 
> of the time the 3 attempts will fail. Setting: 
> "dfs.client.block.write.retries". If it's the case (i.e. all attempts 
> failed), next log line is:
> WARN HDFS.DFSClient: DataStreamer Exception: java.io.IOException: Unable to 
> create new block. And then, if it was for a split log:
> FATAL wal.HLogSplitter: WriterThread-xxx Got while writing log entry to log 
> (various possible stacks here)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to