[ https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack updated HBASE-6626: ------------------------- Attachment: troubleshooting.txt Started converting to docbook. Nicolas, you are missing the [1] link below. What did you intend to point to? > Add a chapter on HDFS in the troubleshooting section of the HBase reference > guide. > ---------------------------------------------------------------------------------- > > Key: HBASE-6626 > URL: https://issues.apache.org/jira/browse/HBASE-6626 > Project: HBase > Issue Type: Improvement > Components: documentation > Affects Versions: 0.96.0 > Reporter: nkeywal > Priority: Minor > Attachments: troubleshooting.txt > > > I looked mainly at the major failure case, but here is what I have: > New sub chapter in the existing chapter "Troubleshooting and Debugging > HBase": "HDFS & HBASE" > 1) HDFS & HBase > 2) Connection related settings > 2.1) Number of retries > 2.2) Timeouts > 3) Log samples > 1) HDFS & HBase > HBase uses HDFS to store its HFile, i.e. the core HBase files and the > Write-Ahead-Logs, i.e. the files that will be used to restore the data after > a crash. > In both cases, the reliability of HBase comes from the fact that HDFS writes > the data to multiple locations. To be efficient, HBase needs the data to be > available locally, hence it's highly recommended to have the HDFS datanode on > the same machines as the HBase Region Servers. > Detailed information on how HDFS works can be found at [1]. > Important features are: > - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. > This class can appears in HBase logs with other HDFS client related logs. > - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS > side, while some other are HDFS-client-side, i.e. must be set in HBase, while > some other must be set in both places. > - the HDFS writes are pipelined from one datanode to another. When writing, > there are communications between: > - HBase and HDFS namenode, through the HDFS client classes. > - HBase and HDFS datanodes, through the HDFS client classes. > - HDFS datanode between themselves: issues on these communications are in > HDFS logs, not HBase. HDFS writes are always local when possible. As a > consequence, there should not be much write error in HBase Region Servers: > they write to the local datanode. If this datanode can't replicate the > blocks, it will appear in its logs, not in the region servers logs. > - datanodes can be contacted through the ipc.Client interface (once again > this class can shows up in HBase logs) and the data transfer interface > (usually shows up as the DataNode class in the HBase logs). There are on > different ports (defaults being: 50010 and 50020). > - To understand exactly what's going on, you must look that the HDFS log > files as well: HBase logs represent the client side. > - With the default setting, HDFS needs 630s to mark a datanode as dead. For > this reason, this node will still be tried by HBase or by other datanodes > when writing and reading until HDFS definitively decides it's dead. This will > add some extras lines in the logs. This monitoring is performed by the > NameNode. > - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on > the NameNode, but can mark temporally a node as dead if they had an error > when they tried to use it. > 2) Settings for retries and timeouts > 2.1) Retries > ipc.client.connect.max.retries > Default 10 > Indicates the number of retries a client will make to establish a server > connection. Not taken into account if the error is a SocketTimeout. In this > case the number of retries is 45 (fixed on branch, HADOOP-7932 or in > HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be > increased, especially if the socket timeouts have been lowered. > ipc.client.connect.max.retries.on.timeouts > Default 45 > If you have HADOOP-7932, max number of retries on timeout. Counter is > different than ipc.client.connect.max.retries so if you mix the socket errors > you will get 55 retries with the default values. Could be lowered, once it is > available. With HADOOP-7397 ipc.client.connect.max.retries is reused so there > would be 10 tries. > dfs.client.block.write.retries > Default 3 > Number of tries for the client when writing a block. After a failure, will > connect to the namenode a get a new location, sending the list of the > datanodes already tried without success. Could be increased, especially if > the socket timeouts have been lowered. See HBASE-6490. > dfs.client.block.write.locateFollowingBlock.retries > Default 5 > Number of retries to the namenode when the client got > NotReplicatedYetException, i.e. the existing nodes of the files are not yet > replicated to dfs.replication.min. This should not impact HBase, as > dfs.replication.min is defaulted to 1. > dfs.client.max.block.acquire.failures > Default 3 > Number of tries to read a block from the datanodes list. In other words, if 5 > datanodes are supposed to hold a block (so dfs.replication equals to 5), the > client will try all these datanodes, then check the value of > dfs.client.max.block.acquire.failures to see if it should retry or not. If > so, it will get a new list (likely the same), and will try to reconnect again > to all these 5 datanodes. COuldbe be increased, especially if the socket > timeouts have been lowered. > > 2.2) Timeouts > 2.3.1) Heatbeats > dfs.heartbeat.interval > Default is 3s > heartbeat.recheck.interval = 300s > Defaults is 300S > A datanode is considered as dead when there is no heartbeat for (2 * > heartbeat.recheck.interval + 10 * dfs.heartbeat.interval) seconds. That's > 630s. So before the 10:30 minutes, the datanode is considered as fully > available by the namenode. After this delay, HDFS is likely to start > replicating the blocks contained in the dead node to get back to the right > number of replica. As a consequence, if we're too aggressive we will have a > side effect here, adding workload to an already damaged cluster. For this > reason it's not recommended to change these settings. > As there are communications between the datanodes, and as they share these > settings, these settings are both HDFS-client-side and HDFS-server-side. > 2.3.2) Socket timeouts > 3 timeouts are considered in HDFS: > - connect timeout: the timeout when we tried to establish the connection > - read timeout: the timeout when we read something on an already established > connection > - write timeout: the timeout when we try to write something on an already > established connection. > They are managed by two settings: > dfs.socket.timeout > Default 60s > dfs.datanode.socket.write.timeout > Default is 480s. > But these setting are used: > - between the DFSClient and the datanode > - between the ipc.Client and the datanodes > - Between the datanodes > - sometimes but not always with an extension (depending on the number of > replica) > - for dfs.socket.timeout as a socket connect timeout but as well as a socket > read timeout. > - for dfs.datanode.socket.write.timeout, when it's set to 0, a plain old java > socket is created in some cases instead of a NIO. > final calculated connect timeout can be: > hard-coded to 20s for the the ipc.Client in Hadoop 1.0.3 (changed in > HADOOP-7397) > dfs.socket.timeout (ex: DataNode#DataTransfer, DataXceiver#replaceBlock) > dfs.socket.timeout + 3s*#replica (ex: DataXceiver#write, > DFSClient#getFileChecksum called from FileCheckSumServlet) > final read timeouts can be: > dfs.socket.timeout (DataXceiver#replaceBlock, ipc.Client from DFSClient) > dfs.socket.timeout + 3s*#replica (ex: DataNode#DataTransfer, > DataXceiver#write) > dfs.socket.timeout * #replica (ex: DataNode#DataTransfer) > > final calculated write timeouts can be: > dfs.datanode.socket.write.timeout (ex DataXceiver#copyBlock/readBlock/...) > dfs.datanode.socket.write.timeout + 5s*#replica) (ex > DFSClient#createBlockOutputStream, DataXceiver#writeBlock) > dfs.datanode.socket.write.timeout + 5s*(#replica -1) (ex: > DataNode#DataTransfer. See HADOOP-5464). > Hence we will often see a 69000 timeout in the logs before the datanode is > marked dead/excluded. Also, setting "dfs.socket.timeout" to 0 does not make > it wait forever, but likely 9 seconds instead of 69s for data transfer. > 3) Typical error logs. > 3.1) Typical logs when all datanode for a block are dead, making the HBase > recovery impossible. HBase master logs will contain, with a 0.90 HBase: > INFO HDFS.DFSClient: Failed to connect to /xxx50010, add to deadNodes and > continue java.net.SocketTimeoutException: 60000 millis timeout while waiting > for channel to be ready for connect. ch : > java.nio.channels.SocketChannel[connection-pending > remote=/region-server-1:50010] > => The client tries to connect to a dead datanode. > => It failed, so the client will try the next datanode in the list. Usually > the list size is 3 (dfs.replication). > => If the final list is empty, it means that all the datanodes proposed by > the namenode are in our datanodes list. > => The HDFS client clears the dead nodes list and sleeps 3 seconds > (hard-coded), shallowing InterruptedException, and asks again to the > namenode. This is the log line: > INFO HDFS.DFSClient: Could not obtain block blk_xxx from any node: > java.io.IOException: No live nodes contain current block. Will get new block > locations from namenode and retry... > => All the locations initially given by the namenode to this client are > actually dead. The client asks for a new set of locations. > => We're very likely to have exactly the same datanode list as 3 seconds ago, > except if a Datanode came back to life or if a replication has just finished. > => After dfs.client.max.block.acquire.failures (default: 3), an exception is > thrown, then logged, and we have in the logs: > WARN HDFS.DFSClient: DFS Read: java.io.IOException: Could not obtain block: > blk_xxx file=/hbase/.logs/boxxxx,60020,xxx/xxx%3A60020.yyy > => There is another retry, hard-coded to 2, but this is logged only once, > even if the second try fails. > => Moreover, for the second try the errors counters are not reinitialized, > including the dead nodes list, so this second attempt is unlikely to succeed. > It should come again with an empty node list, and throw a new > java.io.IOException: Could not obtain block: blk_xxx > file=/hbase/.logs/boxxxx,60020,xxx/xxx%3A60020.yyy > => This exception will go to the final client (hbase). HBase will log it, and > we will see > INFO wal.HLogSplitter: Got while parsing hlog > HDFS://namodenode:8020//hbase/.logs/boxxxx,60020,xxx/xxx60020.yyy. Marking as > corrupted java.io.IOException: Could not obtain block: blk_xxx > file=/hbase/.logs/boxxxx,60020,xxx/xxx60020.yyy > 3.2) Typical log for write issues: the master reads the log, then wants to > split it, hence writing a block: > INFO org.apache.hadoop.HDFS.DFSClient: Exception in createBlockOutputStream > java.net.SocketTimeoutException: 69000 millis timeout while waiting for > channel to be ready for connect. ch : > java.nio.channels.SocketChannel[connection-pending remote=/ xxx:50010] > => We tried to connect to the dead datanode to write. Likely from the master > (it does not have a datanode, so it connects to a remote datanode). > => A region server will not have this type of error, as it connects to a > local datanode to write. > => It failed at the beginning, we cannot connect at all (i.e. not during the > write itself) > INFO HDFS.DFSClient: Abandoning block blk_xxx > => HBase (as a HDFS client) told to the namenode that the block is not > written. > INFO HDFS.DFSClient: Excluding datanode xxx:50010 > => Internally in HDFS client the stream puts it in the excludedNodes list > (the "Excluding datanode" log line ). > => The HDFS client is going again to the namenode asking for another datanode > set proposal, sending the excluded datanode list to be sure it's not trying > on the same nodes again. > => There will be 3 retries by default. If you've lost 20% of your cluster 1% > of the time the 3 attempts will fail. Setting: > "dfs.client.block.write.retries". If it's the case (i.e. all attempts > failed), next log line is: > WARN HDFS.DFSClient: DataStreamer Exception: java.io.IOException: Unable to > create new block. And then, if it was for a split log: > FATAL wal.HLogSplitter: WriterThread-xxx Got while writing log entry to log > (various possible stacks here) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira