[jira] [Updated] (HBASE-6626) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.
[ https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hsieh updated HBASE-6626: -- Resolution: Fixed Fix Version/s: 2.0.0 0.99.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide. -- Key: HBASE-6626 URL: https://issues.apache.org/jira/browse/HBASE-6626 Project: HBase Issue Type: Improvement Components: documentation Affects Versions: 0.95.2 Reporter: Nicolas Liochon Assignee: Misty Stanley-Jones Priority: Blocker Fix For: 0.99.0, 2.0.0 Attachments: HBASE-6626.patch, troubleshooting.txt I looked mainly at the major failure case, but here is what I have: New sub chapter in the existing chapter Troubleshooting and Debugging HBase: HDFS HBASE 1) HDFS HBase 2) Connection related settings 2.1) Number of retries 2.2) Timeouts 3) Log samples 1) HDFS HBase HBase uses HDFS to store its HFile, i.e. the core HBase files and the Write-Ahead-Logs, i.e. the files that will be used to restore the data after a crash. In both cases, the reliability of HBase comes from the fact that HDFS writes the data to multiple locations. To be efficient, HBase needs the data to be available locally, hence it's highly recommended to have the HDFS datanode on the same machines as the HBase Region Servers. Detailed information on how HDFS works can be found at [1]. Important features are: - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. This class can appears in HBase logs with other HDFS client related logs. - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS side, while some other are HDFS-client-side, i.e. must be set in HBase, while some other must be set in both places. - the HDFS writes are pipelined from one datanode to another. When writing, there are communications between: - HBase and HDFS namenode, through the HDFS client classes. - HBase and HDFS datanodes, through the HDFS client classes. - HDFS datanode between themselves: issues on these communications are in HDFS logs, not HBase. HDFS writes are always local when possible. As a consequence, there should not be much write error in HBase Region Servers: they write to the local datanode. If this datanode can't replicate the blocks, it will appear in its logs, not in the region servers logs. - datanodes can be contacted through the ipc.Client interface (once again this class can shows up in HBase logs) and the data transfer interface (usually shows up as the DataNode class in the HBase logs). There are on different ports (defaults being: 50010 and 50020). - To understand exactly what's going on, you must look that the HDFS log files as well: HBase logs represent the client side. - With the default setting, HDFS needs 630s to mark a datanode as dead. For this reason, this node will still be tried by HBase or by other datanodes when writing and reading until HDFS definitively decides it's dead. This will add some extras lines in the logs. This monitoring is performed by the NameNode. - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on the NameNode, but can mark temporally a node as dead if they had an error when they tried to use it. 2) Settings for retries and timeouts 2.1) Retries ipc.client.connect.max.retries Default 10 Indicates the number of retries a client will make to establish a server connection. Not taken into account if the error is a SocketTimeout. In this case the number of retries is 45 (fixed on branch, HADOOP-7932 or in HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be increased, especially if the socket timeouts have been lowered. ipc.client.connect.max.retries.on.timeouts Default 45 If you have HADOOP-7932, max number of retries on timeout. Counter is different than ipc.client.connect.max.retries so if you mix the socket errors you will get 55 retries with the default values. Could be lowered, once it is available. With HADOOP-7397 ipc.client.connect.max.retries is reused so there would be 10 tries. dfs.client.block.write.retries Default 3 Number of tries for the client when writing a block. After a failure, will connect to the namenode a get a new location, sending the list of the datanodes already tried without success. Could be increased, especially if the socket timeouts have been lowered. See HBASE-6490. dfs.client.block.write.locateFollowingBlock.retries Default 5 Number of retries to the namenode when the client got NotReplicatedYetException, i.e. the existing
[jira] [Updated] (HBASE-6626) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.
[ https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Misty Stanley-Jones updated HBASE-6626: --- Attachment: HBASE-6626.patch I made an attempt. I did not integrate the info in the comments but I did check the initial content and updated the Hadoop parameters and defaults where needed. I left a couple of the parameters out because they didn't seem to exist anymore or were marked as 'expert' in the HDFS config docs. I would consider 'expert' parameters for HDFS to be out of scope and possibly dangerous for HBase to recommend tweaking. WDYT? Add a chapter on HDFS in the troubleshooting section of the HBase reference guide. -- Key: HBASE-6626 URL: https://issues.apache.org/jira/browse/HBASE-6626 Project: HBase Issue Type: Improvement Components: documentation Affects Versions: 0.95.2 Reporter: Nicolas Liochon Assignee: Misty Stanley-Jones Priority: Blocker Attachments: HBASE-6626.patch, troubleshooting.txt I looked mainly at the major failure case, but here is what I have: New sub chapter in the existing chapter Troubleshooting and Debugging HBase: HDFS HBASE 1) HDFS HBase 2) Connection related settings 2.1) Number of retries 2.2) Timeouts 3) Log samples 1) HDFS HBase HBase uses HDFS to store its HFile, i.e. the core HBase files and the Write-Ahead-Logs, i.e. the files that will be used to restore the data after a crash. In both cases, the reliability of HBase comes from the fact that HDFS writes the data to multiple locations. To be efficient, HBase needs the data to be available locally, hence it's highly recommended to have the HDFS datanode on the same machines as the HBase Region Servers. Detailed information on how HDFS works can be found at [1]. Important features are: - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. This class can appears in HBase logs with other HDFS client related logs. - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS side, while some other are HDFS-client-side, i.e. must be set in HBase, while some other must be set in both places. - the HDFS writes are pipelined from one datanode to another. When writing, there are communications between: - HBase and HDFS namenode, through the HDFS client classes. - HBase and HDFS datanodes, through the HDFS client classes. - HDFS datanode between themselves: issues on these communications are in HDFS logs, not HBase. HDFS writes are always local when possible. As a consequence, there should not be much write error in HBase Region Servers: they write to the local datanode. If this datanode can't replicate the blocks, it will appear in its logs, not in the region servers logs. - datanodes can be contacted through the ipc.Client interface (once again this class can shows up in HBase logs) and the data transfer interface (usually shows up as the DataNode class in the HBase logs). There are on different ports (defaults being: 50010 and 50020). - To understand exactly what's going on, you must look that the HDFS log files as well: HBase logs represent the client side. - With the default setting, HDFS needs 630s to mark a datanode as dead. For this reason, this node will still be tried by HBase or by other datanodes when writing and reading until HDFS definitively decides it's dead. This will add some extras lines in the logs. This monitoring is performed by the NameNode. - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on the NameNode, but can mark temporally a node as dead if they had an error when they tried to use it. 2) Settings for retries and timeouts 2.1) Retries ipc.client.connect.max.retries Default 10 Indicates the number of retries a client will make to establish a server connection. Not taken into account if the error is a SocketTimeout. In this case the number of retries is 45 (fixed on branch, HADOOP-7932 or in HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be increased, especially if the socket timeouts have been lowered. ipc.client.connect.max.retries.on.timeouts Default 45 If you have HADOOP-7932, max number of retries on timeout. Counter is different than ipc.client.connect.max.retries so if you mix the socket errors you will get 55 retries with the default values. Could be lowered, once it is available. With HADOOP-7397 ipc.client.connect.max.retries is reused so there would be 10 tries. dfs.client.block.write.retries Default 3 Number of tries for the client when writing a block. After a failure, will connect to the namenode a get a new location, sending the list of the datanodes already tried
[jira] [Updated] (HBASE-6626) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.
[ https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Misty Stanley-Jones updated HBASE-6626: --- Status: Patch Available (was: Open) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide. -- Key: HBASE-6626 URL: https://issues.apache.org/jira/browse/HBASE-6626 Project: HBase Issue Type: Improvement Components: documentation Affects Versions: 0.95.2 Reporter: Nicolas Liochon Assignee: Misty Stanley-Jones Priority: Blocker Attachments: HBASE-6626.patch, troubleshooting.txt I looked mainly at the major failure case, but here is what I have: New sub chapter in the existing chapter Troubleshooting and Debugging HBase: HDFS HBASE 1) HDFS HBase 2) Connection related settings 2.1) Number of retries 2.2) Timeouts 3) Log samples 1) HDFS HBase HBase uses HDFS to store its HFile, i.e. the core HBase files and the Write-Ahead-Logs, i.e. the files that will be used to restore the data after a crash. In both cases, the reliability of HBase comes from the fact that HDFS writes the data to multiple locations. To be efficient, HBase needs the data to be available locally, hence it's highly recommended to have the HDFS datanode on the same machines as the HBase Region Servers. Detailed information on how HDFS works can be found at [1]. Important features are: - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. This class can appears in HBase logs with other HDFS client related logs. - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS side, while some other are HDFS-client-side, i.e. must be set in HBase, while some other must be set in both places. - the HDFS writes are pipelined from one datanode to another. When writing, there are communications between: - HBase and HDFS namenode, through the HDFS client classes. - HBase and HDFS datanodes, through the HDFS client classes. - HDFS datanode between themselves: issues on these communications are in HDFS logs, not HBase. HDFS writes are always local when possible. As a consequence, there should not be much write error in HBase Region Servers: they write to the local datanode. If this datanode can't replicate the blocks, it will appear in its logs, not in the region servers logs. - datanodes can be contacted through the ipc.Client interface (once again this class can shows up in HBase logs) and the data transfer interface (usually shows up as the DataNode class in the HBase logs). There are on different ports (defaults being: 50010 and 50020). - To understand exactly what's going on, you must look that the HDFS log files as well: HBase logs represent the client side. - With the default setting, HDFS needs 630s to mark a datanode as dead. For this reason, this node will still be tried by HBase or by other datanodes when writing and reading until HDFS definitively decides it's dead. This will add some extras lines in the logs. This monitoring is performed by the NameNode. - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on the NameNode, but can mark temporally a node as dead if they had an error when they tried to use it. 2) Settings for retries and timeouts 2.1) Retries ipc.client.connect.max.retries Default 10 Indicates the number of retries a client will make to establish a server connection. Not taken into account if the error is a SocketTimeout. In this case the number of retries is 45 (fixed on branch, HADOOP-7932 or in HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be increased, especially if the socket timeouts have been lowered. ipc.client.connect.max.retries.on.timeouts Default 45 If you have HADOOP-7932, max number of retries on timeout. Counter is different than ipc.client.connect.max.retries so if you mix the socket errors you will get 55 retries with the default values. Could be lowered, once it is available. With HADOOP-7397 ipc.client.connect.max.retries is reused so there would be 10 tries. dfs.client.block.write.retries Default 3 Number of tries for the client when writing a block. After a failure, will connect to the namenode a get a new location, sending the list of the datanodes already tried without success. Could be increased, especially if the socket timeouts have been lowered. See HBASE-6490. dfs.client.block.write.locateFollowingBlock.retries Default 5 Number of retries to the namenode when the client got NotReplicatedYetException, i.e. the existing nodes of the files are not yet replicated to dfs.replication.min. This should not impact HBase, as dfs.replication.min is defaulted to 1.
[jira] [Updated] (HBASE-6626) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.
[ https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Misty Stanley-Jones updated HBASE-6626: --- Attachment: (was: HBASE-6626.patch) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide. -- Key: HBASE-6626 URL: https://issues.apache.org/jira/browse/HBASE-6626 Project: HBase Issue Type: Improvement Components: documentation Affects Versions: 0.95.2 Reporter: Nicolas Liochon Assignee: Misty Stanley-Jones Priority: Blocker Attachments: troubleshooting.txt I looked mainly at the major failure case, but here is what I have: New sub chapter in the existing chapter Troubleshooting and Debugging HBase: HDFS HBASE 1) HDFS HBase 2) Connection related settings 2.1) Number of retries 2.2) Timeouts 3) Log samples 1) HDFS HBase HBase uses HDFS to store its HFile, i.e. the core HBase files and the Write-Ahead-Logs, i.e. the files that will be used to restore the data after a crash. In both cases, the reliability of HBase comes from the fact that HDFS writes the data to multiple locations. To be efficient, HBase needs the data to be available locally, hence it's highly recommended to have the HDFS datanode on the same machines as the HBase Region Servers. Detailed information on how HDFS works can be found at [1]. Important features are: - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. This class can appears in HBase logs with other HDFS client related logs. - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS side, while some other are HDFS-client-side, i.e. must be set in HBase, while some other must be set in both places. - the HDFS writes are pipelined from one datanode to another. When writing, there are communications between: - HBase and HDFS namenode, through the HDFS client classes. - HBase and HDFS datanodes, through the HDFS client classes. - HDFS datanode between themselves: issues on these communications are in HDFS logs, not HBase. HDFS writes are always local when possible. As a consequence, there should not be much write error in HBase Region Servers: they write to the local datanode. If this datanode can't replicate the blocks, it will appear in its logs, not in the region servers logs. - datanodes can be contacted through the ipc.Client interface (once again this class can shows up in HBase logs) and the data transfer interface (usually shows up as the DataNode class in the HBase logs). There are on different ports (defaults being: 50010 and 50020). - To understand exactly what's going on, you must look that the HDFS log files as well: HBase logs represent the client side. - With the default setting, HDFS needs 630s to mark a datanode as dead. For this reason, this node will still be tried by HBase or by other datanodes when writing and reading until HDFS definitively decides it's dead. This will add some extras lines in the logs. This monitoring is performed by the NameNode. - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on the NameNode, but can mark temporally a node as dead if they had an error when they tried to use it. 2) Settings for retries and timeouts 2.1) Retries ipc.client.connect.max.retries Default 10 Indicates the number of retries a client will make to establish a server connection. Not taken into account if the error is a SocketTimeout. In this case the number of retries is 45 (fixed on branch, HADOOP-7932 or in HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be increased, especially if the socket timeouts have been lowered. ipc.client.connect.max.retries.on.timeouts Default 45 If you have HADOOP-7932, max number of retries on timeout. Counter is different than ipc.client.connect.max.retries so if you mix the socket errors you will get 55 retries with the default values. Could be lowered, once it is available. With HADOOP-7397 ipc.client.connect.max.retries is reused so there would be 10 tries. dfs.client.block.write.retries Default 3 Number of tries for the client when writing a block. After a failure, will connect to the namenode a get a new location, sending the list of the datanodes already tried without success. Could be increased, especially if the socket timeouts have been lowered. See HBASE-6490. dfs.client.block.write.locateFollowingBlock.retries Default 5 Number of retries to the namenode when the client got NotReplicatedYetException, i.e. the existing nodes of the files are not yet replicated to dfs.replication.min. This should not impact HBase, as dfs.replication.min is defaulted to 1.
[jira] [Updated] (HBASE-6626) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.
[ https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Misty Stanley-Jones updated HBASE-6626: --- Attachment: HBASE-6626.patch Re-generated the patch. I can apply it to the current master so I'm not sure what is wrong. Add a chapter on HDFS in the troubleshooting section of the HBase reference guide. -- Key: HBASE-6626 URL: https://issues.apache.org/jira/browse/HBASE-6626 Project: HBase Issue Type: Improvement Components: documentation Affects Versions: 0.95.2 Reporter: Nicolas Liochon Assignee: Misty Stanley-Jones Priority: Blocker Attachments: HBASE-6626.patch, troubleshooting.txt I looked mainly at the major failure case, but here is what I have: New sub chapter in the existing chapter Troubleshooting and Debugging HBase: HDFS HBASE 1) HDFS HBase 2) Connection related settings 2.1) Number of retries 2.2) Timeouts 3) Log samples 1) HDFS HBase HBase uses HDFS to store its HFile, i.e. the core HBase files and the Write-Ahead-Logs, i.e. the files that will be used to restore the data after a crash. In both cases, the reliability of HBase comes from the fact that HDFS writes the data to multiple locations. To be efficient, HBase needs the data to be available locally, hence it's highly recommended to have the HDFS datanode on the same machines as the HBase Region Servers. Detailed information on how HDFS works can be found at [1]. Important features are: - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. This class can appears in HBase logs with other HDFS client related logs. - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS side, while some other are HDFS-client-side, i.e. must be set in HBase, while some other must be set in both places. - the HDFS writes are pipelined from one datanode to another. When writing, there are communications between: - HBase and HDFS namenode, through the HDFS client classes. - HBase and HDFS datanodes, through the HDFS client classes. - HDFS datanode between themselves: issues on these communications are in HDFS logs, not HBase. HDFS writes are always local when possible. As a consequence, there should not be much write error in HBase Region Servers: they write to the local datanode. If this datanode can't replicate the blocks, it will appear in its logs, not in the region servers logs. - datanodes can be contacted through the ipc.Client interface (once again this class can shows up in HBase logs) and the data transfer interface (usually shows up as the DataNode class in the HBase logs). There are on different ports (defaults being: 50010 and 50020). - To understand exactly what's going on, you must look that the HDFS log files as well: HBase logs represent the client side. - With the default setting, HDFS needs 630s to mark a datanode as dead. For this reason, this node will still be tried by HBase or by other datanodes when writing and reading until HDFS definitively decides it's dead. This will add some extras lines in the logs. This monitoring is performed by the NameNode. - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on the NameNode, but can mark temporally a node as dead if they had an error when they tried to use it. 2) Settings for retries and timeouts 2.1) Retries ipc.client.connect.max.retries Default 10 Indicates the number of retries a client will make to establish a server connection. Not taken into account if the error is a SocketTimeout. In this case the number of retries is 45 (fixed on branch, HADOOP-7932 or in HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be increased, especially if the socket timeouts have been lowered. ipc.client.connect.max.retries.on.timeouts Default 45 If you have HADOOP-7932, max number of retries on timeout. Counter is different than ipc.client.connect.max.retries so if you mix the socket errors you will get 55 retries with the default values. Could be lowered, once it is available. With HADOOP-7397 ipc.client.connect.max.retries is reused so there would be 10 tries. dfs.client.block.write.retries Default 3 Number of tries for the client when writing a block. After a failure, will connect to the namenode a get a new location, sending the list of the datanodes already tried without success. Could be increased, especially if the socket timeouts have been lowered. See HBASE-6490. dfs.client.block.write.locateFollowingBlock.retries Default 5 Number of retries to the namenode when the client got NotReplicatedYetException, i.e. the existing nodes of the files are not yet replicated to
[jira] [Updated] (HBASE-6626) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.
[ https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-6626: - Attachment: troubleshooting.txt Started converting to docbook. Nicolas, you are missing the [1] link below. What did you intend to point to? Add a chapter on HDFS in the troubleshooting section of the HBase reference guide. -- Key: HBASE-6626 URL: https://issues.apache.org/jira/browse/HBASE-6626 Project: HBase Issue Type: Improvement Components: documentation Affects Versions: 0.96.0 Reporter: nkeywal Priority: Minor Attachments: troubleshooting.txt I looked mainly at the major failure case, but here is what I have: New sub chapter in the existing chapter Troubleshooting and Debugging HBase: HDFS HBASE 1) HDFS HBase 2) Connection related settings 2.1) Number of retries 2.2) Timeouts 3) Log samples 1) HDFS HBase HBase uses HDFS to store its HFile, i.e. the core HBase files and the Write-Ahead-Logs, i.e. the files that will be used to restore the data after a crash. In both cases, the reliability of HBase comes from the fact that HDFS writes the data to multiple locations. To be efficient, HBase needs the data to be available locally, hence it's highly recommended to have the HDFS datanode on the same machines as the HBase Region Servers. Detailed information on how HDFS works can be found at [1]. Important features are: - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. This class can appears in HBase logs with other HDFS client related logs. - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS side, while some other are HDFS-client-side, i.e. must be set in HBase, while some other must be set in both places. - the HDFS writes are pipelined from one datanode to another. When writing, there are communications between: - HBase and HDFS namenode, through the HDFS client classes. - HBase and HDFS datanodes, through the HDFS client classes. - HDFS datanode between themselves: issues on these communications are in HDFS logs, not HBase. HDFS writes are always local when possible. As a consequence, there should not be much write error in HBase Region Servers: they write to the local datanode. If this datanode can't replicate the blocks, it will appear in its logs, not in the region servers logs. - datanodes can be contacted through the ipc.Client interface (once again this class can shows up in HBase logs) and the data transfer interface (usually shows up as the DataNode class in the HBase logs). There are on different ports (defaults being: 50010 and 50020). - To understand exactly what's going on, you must look that the HDFS log files as well: HBase logs represent the client side. - With the default setting, HDFS needs 630s to mark a datanode as dead. For this reason, this node will still be tried by HBase or by other datanodes when writing and reading until HDFS definitively decides it's dead. This will add some extras lines in the logs. This monitoring is performed by the NameNode. - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on the NameNode, but can mark temporally a node as dead if they had an error when they tried to use it. 2) Settings for retries and timeouts 2.1) Retries ipc.client.connect.max.retries Default 10 Indicates the number of retries a client will make to establish a server connection. Not taken into account if the error is a SocketTimeout. In this case the number of retries is 45 (fixed on branch, HADOOP-7932 or in HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be increased, especially if the socket timeouts have been lowered. ipc.client.connect.max.retries.on.timeouts Default 45 If you have HADOOP-7932, max number of retries on timeout. Counter is different than ipc.client.connect.max.retries so if you mix the socket errors you will get 55 retries with the default values. Could be lowered, once it is available. With HADOOP-7397 ipc.client.connect.max.retries is reused so there would be 10 tries. dfs.client.block.write.retries Default 3 Number of tries for the client when writing a block. After a failure, will connect to the namenode a get a new location, sending the list of the datanodes already tried without success. Could be increased, especially if the socket timeouts have been lowered. See HBASE-6490. dfs.client.block.write.locateFollowingBlock.retries Default 5 Number of retries to the namenode when the client got NotReplicatedYetException, i.e. the existing nodes of the files are not yet replicated to dfs.replication.min. This should not impact HBase, as