[jira] [Updated] (HBASE-6626) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.

2014-08-07 Thread Jonathan Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hsieh updated HBASE-6626:
--

   Resolution: Fixed
Fix Version/s: 2.0.0
   0.99.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

 Add a chapter on HDFS in the troubleshooting section of the HBase reference 
 guide.
 --

 Key: HBASE-6626
 URL: https://issues.apache.org/jira/browse/HBASE-6626
 Project: HBase
  Issue Type: Improvement
  Components: documentation
Affects Versions: 0.95.2
Reporter: Nicolas Liochon
Assignee: Misty Stanley-Jones
Priority: Blocker
 Fix For: 0.99.0, 2.0.0

 Attachments: HBASE-6626.patch, troubleshooting.txt


 I looked mainly at the major failure case, but here is what I have:
 New sub chapter in the existing chapter Troubleshooting and Debugging 
 HBase: HDFS  HBASE
 1) HDFS  HBase
 2) Connection related settings
 2.1) Number of retries
 2.2) Timeouts
 3) Log samples
 1) HDFS  HBase
 HBase uses HDFS to store its HFile, i.e. the core HBase files and the 
 Write-Ahead-Logs, i.e. the files that will be used to restore the data after 
 a crash.
 In both cases, the reliability of HBase comes from the fact that HDFS writes 
 the data to multiple locations. To be efficient, HBase needs the data to be 
 available locally, hence it's highly recommended to have the HDFS datanode on 
 the same machines as the HBase Region Servers.
 Detailed information on how HDFS works can be found at [1].
 Important features are:
  - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. 
 This class can appears in HBase logs with other HDFS client related logs.
  - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS 
 side, while some other are HDFS-client-side, i.e. must be set in HBase, while 
 some other must be set in both places.
  - the HDFS writes are pipelined from one datanode to another. When writing, 
 there are communications between:
 - HBase and HDFS namenode, through the HDFS client classes.
 - HBase and HDFS datanodes, through the HDFS client classes.
 - HDFS datanode between themselves: issues on these communications are in 
 HDFS logs, not HBase. HDFS writes are always local when possible. As a 
 consequence, there should not be much write error in HBase Region Servers: 
 they write to the local datanode. If this datanode can't replicate the 
 blocks, it will appear in its logs, not in the region servers logs.
  - datanodes can be contacted through the ipc.Client interface (once again 
 this class can shows up in HBase logs) and the data transfer interface 
 (usually shows up as the DataNode class in the HBase logs). There are on 
 different ports (defaults being: 50010 and 50020).
  - To understand exactly what's going on, you must look that the HDFS log 
 files as well: HBase logs represent the client side.
  - With the default setting, HDFS needs 630s to mark a datanode as dead. For 
 this reason, this node will still be tried by HBase or by other datanodes 
 when writing and reading until HDFS definitively decides it's dead. This will 
 add some extras lines in the logs. This monitoring is performed by the 
 NameNode.
  - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on 
 the NameNode, but can mark temporally a node as dead if they had an error 
 when they tried to use it.
 2) Settings for retries and timeouts
 2.1) Retries
 ipc.client.connect.max.retries
 Default 10
 Indicates the number of retries a client will make to establish a server 
 connection. Not taken into account if the error is a SocketTimeout. In this 
 case the number of retries is 45 (fixed on branch, HADOOP-7932 or in 
 HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be 
 increased, especially if the socket timeouts have been lowered.
 ipc.client.connect.max.retries.on.timeouts
 Default 45
 If you have HADOOP-7932, max number of retries on timeout. Counter is 
 different than ipc.client.connect.max.retries so if you mix the socket errors 
 you will get 55 retries with the default values. Could be lowered, once it is 
 available. With HADOOP-7397 ipc.client.connect.max.retries is reused so there 
 would be 10 tries.
 dfs.client.block.write.retries
 Default 3
 Number of tries for the client when writing a block. After a failure, will 
 connect to the namenode a get a new location, sending the list of the 
 datanodes already tried without success. Could be increased, especially if 
 the socket timeouts have been lowered. See HBASE-6490.
 dfs.client.block.write.locateFollowingBlock.retries
 Default 5
 Number of retries to the namenode when the client got 
 NotReplicatedYetException, i.e. the existing 

[jira] [Updated] (HBASE-6626) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.

2014-07-09 Thread Misty Stanley-Jones (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Misty Stanley-Jones updated HBASE-6626:
---

Attachment: HBASE-6626.patch

I made an attempt. I did not integrate the info in the comments but I did check 
the initial content and updated the Hadoop parameters and defaults where 
needed. I left a couple of the parameters out because they didn't seem to exist 
anymore or were marked as 'expert' in the HDFS config docs. I would consider 
'expert' parameters for HDFS to be out of scope and possibly dangerous for 
HBase to recommend tweaking. WDYT?

 Add a chapter on HDFS in the troubleshooting section of the HBase reference 
 guide.
 --

 Key: HBASE-6626
 URL: https://issues.apache.org/jira/browse/HBASE-6626
 Project: HBase
  Issue Type: Improvement
  Components: documentation
Affects Versions: 0.95.2
Reporter: Nicolas Liochon
Assignee: Misty Stanley-Jones
Priority: Blocker
 Attachments: HBASE-6626.patch, troubleshooting.txt


 I looked mainly at the major failure case, but here is what I have:
 New sub chapter in the existing chapter Troubleshooting and Debugging 
 HBase: HDFS  HBASE
 1) HDFS  HBase
 2) Connection related settings
 2.1) Number of retries
 2.2) Timeouts
 3) Log samples
 1) HDFS  HBase
 HBase uses HDFS to store its HFile, i.e. the core HBase files and the 
 Write-Ahead-Logs, i.e. the files that will be used to restore the data after 
 a crash.
 In both cases, the reliability of HBase comes from the fact that HDFS writes 
 the data to multiple locations. To be efficient, HBase needs the data to be 
 available locally, hence it's highly recommended to have the HDFS datanode on 
 the same machines as the HBase Region Servers.
 Detailed information on how HDFS works can be found at [1].
 Important features are:
  - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. 
 This class can appears in HBase logs with other HDFS client related logs.
  - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS 
 side, while some other are HDFS-client-side, i.e. must be set in HBase, while 
 some other must be set in both places.
  - the HDFS writes are pipelined from one datanode to another. When writing, 
 there are communications between:
 - HBase and HDFS namenode, through the HDFS client classes.
 - HBase and HDFS datanodes, through the HDFS client classes.
 - HDFS datanode between themselves: issues on these communications are in 
 HDFS logs, not HBase. HDFS writes are always local when possible. As a 
 consequence, there should not be much write error in HBase Region Servers: 
 they write to the local datanode. If this datanode can't replicate the 
 blocks, it will appear in its logs, not in the region servers logs.
  - datanodes can be contacted through the ipc.Client interface (once again 
 this class can shows up in HBase logs) and the data transfer interface 
 (usually shows up as the DataNode class in the HBase logs). There are on 
 different ports (defaults being: 50010 and 50020).
  - To understand exactly what's going on, you must look that the HDFS log 
 files as well: HBase logs represent the client side.
  - With the default setting, HDFS needs 630s to mark a datanode as dead. For 
 this reason, this node will still be tried by HBase or by other datanodes 
 when writing and reading until HDFS definitively decides it's dead. This will 
 add some extras lines in the logs. This monitoring is performed by the 
 NameNode.
  - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on 
 the NameNode, but can mark temporally a node as dead if they had an error 
 when they tried to use it.
 2) Settings for retries and timeouts
 2.1) Retries
 ipc.client.connect.max.retries
 Default 10
 Indicates the number of retries a client will make to establish a server 
 connection. Not taken into account if the error is a SocketTimeout. In this 
 case the number of retries is 45 (fixed on branch, HADOOP-7932 or in 
 HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be 
 increased, especially if the socket timeouts have been lowered.
 ipc.client.connect.max.retries.on.timeouts
 Default 45
 If you have HADOOP-7932, max number of retries on timeout. Counter is 
 different than ipc.client.connect.max.retries so if you mix the socket errors 
 you will get 55 retries with the default values. Could be lowered, once it is 
 available. With HADOOP-7397 ipc.client.connect.max.retries is reused so there 
 would be 10 tries.
 dfs.client.block.write.retries
 Default 3
 Number of tries for the client when writing a block. After a failure, will 
 connect to the namenode a get a new location, sending the list of the 
 datanodes already tried 

[jira] [Updated] (HBASE-6626) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.

2014-07-09 Thread Misty Stanley-Jones (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Misty Stanley-Jones updated HBASE-6626:
---

Status: Patch Available  (was: Open)

 Add a chapter on HDFS in the troubleshooting section of the HBase reference 
 guide.
 --

 Key: HBASE-6626
 URL: https://issues.apache.org/jira/browse/HBASE-6626
 Project: HBase
  Issue Type: Improvement
  Components: documentation
Affects Versions: 0.95.2
Reporter: Nicolas Liochon
Assignee: Misty Stanley-Jones
Priority: Blocker
 Attachments: HBASE-6626.patch, troubleshooting.txt


 I looked mainly at the major failure case, but here is what I have:
 New sub chapter in the existing chapter Troubleshooting and Debugging 
 HBase: HDFS  HBASE
 1) HDFS  HBase
 2) Connection related settings
 2.1) Number of retries
 2.2) Timeouts
 3) Log samples
 1) HDFS  HBase
 HBase uses HDFS to store its HFile, i.e. the core HBase files and the 
 Write-Ahead-Logs, i.e. the files that will be used to restore the data after 
 a crash.
 In both cases, the reliability of HBase comes from the fact that HDFS writes 
 the data to multiple locations. To be efficient, HBase needs the data to be 
 available locally, hence it's highly recommended to have the HDFS datanode on 
 the same machines as the HBase Region Servers.
 Detailed information on how HDFS works can be found at [1].
 Important features are:
  - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. 
 This class can appears in HBase logs with other HDFS client related logs.
  - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS 
 side, while some other are HDFS-client-side, i.e. must be set in HBase, while 
 some other must be set in both places.
  - the HDFS writes are pipelined from one datanode to another. When writing, 
 there are communications between:
 - HBase and HDFS namenode, through the HDFS client classes.
 - HBase and HDFS datanodes, through the HDFS client classes.
 - HDFS datanode between themselves: issues on these communications are in 
 HDFS logs, not HBase. HDFS writes are always local when possible. As a 
 consequence, there should not be much write error in HBase Region Servers: 
 they write to the local datanode. If this datanode can't replicate the 
 blocks, it will appear in its logs, not in the region servers logs.
  - datanodes can be contacted through the ipc.Client interface (once again 
 this class can shows up in HBase logs) and the data transfer interface 
 (usually shows up as the DataNode class in the HBase logs). There are on 
 different ports (defaults being: 50010 and 50020).
  - To understand exactly what's going on, you must look that the HDFS log 
 files as well: HBase logs represent the client side.
  - With the default setting, HDFS needs 630s to mark a datanode as dead. For 
 this reason, this node will still be tried by HBase or by other datanodes 
 when writing and reading until HDFS definitively decides it's dead. This will 
 add some extras lines in the logs. This monitoring is performed by the 
 NameNode.
  - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on 
 the NameNode, but can mark temporally a node as dead if they had an error 
 when they tried to use it.
 2) Settings for retries and timeouts
 2.1) Retries
 ipc.client.connect.max.retries
 Default 10
 Indicates the number of retries a client will make to establish a server 
 connection. Not taken into account if the error is a SocketTimeout. In this 
 case the number of retries is 45 (fixed on branch, HADOOP-7932 or in 
 HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be 
 increased, especially if the socket timeouts have been lowered.
 ipc.client.connect.max.retries.on.timeouts
 Default 45
 If you have HADOOP-7932, max number of retries on timeout. Counter is 
 different than ipc.client.connect.max.retries so if you mix the socket errors 
 you will get 55 retries with the default values. Could be lowered, once it is 
 available. With HADOOP-7397 ipc.client.connect.max.retries is reused so there 
 would be 10 tries.
 dfs.client.block.write.retries
 Default 3
 Number of tries for the client when writing a block. After a failure, will 
 connect to the namenode a get a new location, sending the list of the 
 datanodes already tried without success. Could be increased, especially if 
 the socket timeouts have been lowered. See HBASE-6490.
 dfs.client.block.write.locateFollowingBlock.retries
 Default 5
 Number of retries to the namenode when the client got 
 NotReplicatedYetException, i.e. the existing nodes of the files are not yet 
 replicated to dfs.replication.min. This should not impact HBase, as 
 dfs.replication.min is defaulted to 1.
 

[jira] [Updated] (HBASE-6626) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.

2014-07-09 Thread Misty Stanley-Jones (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Misty Stanley-Jones updated HBASE-6626:
---

Attachment: (was: HBASE-6626.patch)

 Add a chapter on HDFS in the troubleshooting section of the HBase reference 
 guide.
 --

 Key: HBASE-6626
 URL: https://issues.apache.org/jira/browse/HBASE-6626
 Project: HBase
  Issue Type: Improvement
  Components: documentation
Affects Versions: 0.95.2
Reporter: Nicolas Liochon
Assignee: Misty Stanley-Jones
Priority: Blocker
 Attachments: troubleshooting.txt


 I looked mainly at the major failure case, but here is what I have:
 New sub chapter in the existing chapter Troubleshooting and Debugging 
 HBase: HDFS  HBASE
 1) HDFS  HBase
 2) Connection related settings
 2.1) Number of retries
 2.2) Timeouts
 3) Log samples
 1) HDFS  HBase
 HBase uses HDFS to store its HFile, i.e. the core HBase files and the 
 Write-Ahead-Logs, i.e. the files that will be used to restore the data after 
 a crash.
 In both cases, the reliability of HBase comes from the fact that HDFS writes 
 the data to multiple locations. To be efficient, HBase needs the data to be 
 available locally, hence it's highly recommended to have the HDFS datanode on 
 the same machines as the HBase Region Servers.
 Detailed information on how HDFS works can be found at [1].
 Important features are:
  - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. 
 This class can appears in HBase logs with other HDFS client related logs.
  - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS 
 side, while some other are HDFS-client-side, i.e. must be set in HBase, while 
 some other must be set in both places.
  - the HDFS writes are pipelined from one datanode to another. When writing, 
 there are communications between:
 - HBase and HDFS namenode, through the HDFS client classes.
 - HBase and HDFS datanodes, through the HDFS client classes.
 - HDFS datanode between themselves: issues on these communications are in 
 HDFS logs, not HBase. HDFS writes are always local when possible. As a 
 consequence, there should not be much write error in HBase Region Servers: 
 they write to the local datanode. If this datanode can't replicate the 
 blocks, it will appear in its logs, not in the region servers logs.
  - datanodes can be contacted through the ipc.Client interface (once again 
 this class can shows up in HBase logs) and the data transfer interface 
 (usually shows up as the DataNode class in the HBase logs). There are on 
 different ports (defaults being: 50010 and 50020).
  - To understand exactly what's going on, you must look that the HDFS log 
 files as well: HBase logs represent the client side.
  - With the default setting, HDFS needs 630s to mark a datanode as dead. For 
 this reason, this node will still be tried by HBase or by other datanodes 
 when writing and reading until HDFS definitively decides it's dead. This will 
 add some extras lines in the logs. This monitoring is performed by the 
 NameNode.
  - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on 
 the NameNode, but can mark temporally a node as dead if they had an error 
 when they tried to use it.
 2) Settings for retries and timeouts
 2.1) Retries
 ipc.client.connect.max.retries
 Default 10
 Indicates the number of retries a client will make to establish a server 
 connection. Not taken into account if the error is a SocketTimeout. In this 
 case the number of retries is 45 (fixed on branch, HADOOP-7932 or in 
 HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be 
 increased, especially if the socket timeouts have been lowered.
 ipc.client.connect.max.retries.on.timeouts
 Default 45
 If you have HADOOP-7932, max number of retries on timeout. Counter is 
 different than ipc.client.connect.max.retries so if you mix the socket errors 
 you will get 55 retries with the default values. Could be lowered, once it is 
 available. With HADOOP-7397 ipc.client.connect.max.retries is reused so there 
 would be 10 tries.
 dfs.client.block.write.retries
 Default 3
 Number of tries for the client when writing a block. After a failure, will 
 connect to the namenode a get a new location, sending the list of the 
 datanodes already tried without success. Could be increased, especially if 
 the socket timeouts have been lowered. See HBASE-6490.
 dfs.client.block.write.locateFollowingBlock.retries
 Default 5
 Number of retries to the namenode when the client got 
 NotReplicatedYetException, i.e. the existing nodes of the files are not yet 
 replicated to dfs.replication.min. This should not impact HBase, as 
 dfs.replication.min is defaulted to 1.
 

[jira] [Updated] (HBASE-6626) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.

2014-07-09 Thread Misty Stanley-Jones (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Misty Stanley-Jones updated HBASE-6626:
---

Attachment: HBASE-6626.patch

Re-generated the patch. I can apply it to the current master so I'm not sure 
what is wrong.

 Add a chapter on HDFS in the troubleshooting section of the HBase reference 
 guide.
 --

 Key: HBASE-6626
 URL: https://issues.apache.org/jira/browse/HBASE-6626
 Project: HBase
  Issue Type: Improvement
  Components: documentation
Affects Versions: 0.95.2
Reporter: Nicolas Liochon
Assignee: Misty Stanley-Jones
Priority: Blocker
 Attachments: HBASE-6626.patch, troubleshooting.txt


 I looked mainly at the major failure case, but here is what I have:
 New sub chapter in the existing chapter Troubleshooting and Debugging 
 HBase: HDFS  HBASE
 1) HDFS  HBase
 2) Connection related settings
 2.1) Number of retries
 2.2) Timeouts
 3) Log samples
 1) HDFS  HBase
 HBase uses HDFS to store its HFile, i.e. the core HBase files and the 
 Write-Ahead-Logs, i.e. the files that will be used to restore the data after 
 a crash.
 In both cases, the reliability of HBase comes from the fact that HDFS writes 
 the data to multiple locations. To be efficient, HBase needs the data to be 
 available locally, hence it's highly recommended to have the HDFS datanode on 
 the same machines as the HBase Region Servers.
 Detailed information on how HDFS works can be found at [1].
 Important features are:
  - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. 
 This class can appears in HBase logs with other HDFS client related logs.
  - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS 
 side, while some other are HDFS-client-side, i.e. must be set in HBase, while 
 some other must be set in both places.
  - the HDFS writes are pipelined from one datanode to another. When writing, 
 there are communications between:
 - HBase and HDFS namenode, through the HDFS client classes.
 - HBase and HDFS datanodes, through the HDFS client classes.
 - HDFS datanode between themselves: issues on these communications are in 
 HDFS logs, not HBase. HDFS writes are always local when possible. As a 
 consequence, there should not be much write error in HBase Region Servers: 
 they write to the local datanode. If this datanode can't replicate the 
 blocks, it will appear in its logs, not in the region servers logs.
  - datanodes can be contacted through the ipc.Client interface (once again 
 this class can shows up in HBase logs) and the data transfer interface 
 (usually shows up as the DataNode class in the HBase logs). There are on 
 different ports (defaults being: 50010 and 50020).
  - To understand exactly what's going on, you must look that the HDFS log 
 files as well: HBase logs represent the client side.
  - With the default setting, HDFS needs 630s to mark a datanode as dead. For 
 this reason, this node will still be tried by HBase or by other datanodes 
 when writing and reading until HDFS definitively decides it's dead. This will 
 add some extras lines in the logs. This monitoring is performed by the 
 NameNode.
  - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on 
 the NameNode, but can mark temporally a node as dead if they had an error 
 when they tried to use it.
 2) Settings for retries and timeouts
 2.1) Retries
 ipc.client.connect.max.retries
 Default 10
 Indicates the number of retries a client will make to establish a server 
 connection. Not taken into account if the error is a SocketTimeout. In this 
 case the number of retries is 45 (fixed on branch, HADOOP-7932 or in 
 HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be 
 increased, especially if the socket timeouts have been lowered.
 ipc.client.connect.max.retries.on.timeouts
 Default 45
 If you have HADOOP-7932, max number of retries on timeout. Counter is 
 different than ipc.client.connect.max.retries so if you mix the socket errors 
 you will get 55 retries with the default values. Could be lowered, once it is 
 available. With HADOOP-7397 ipc.client.connect.max.retries is reused so there 
 would be 10 tries.
 dfs.client.block.write.retries
 Default 3
 Number of tries for the client when writing a block. After a failure, will 
 connect to the namenode a get a new location, sending the list of the 
 datanodes already tried without success. Could be increased, especially if 
 the socket timeouts have been lowered. See HBASE-6490.
 dfs.client.block.write.locateFollowingBlock.retries
 Default 5
 Number of retries to the namenode when the client got 
 NotReplicatedYetException, i.e. the existing nodes of the files are not yet 
 replicated to 

[jira] [Updated] (HBASE-6626) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.

2012-08-21 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-6626:
-

Attachment: troubleshooting.txt

Started converting to docbook.

Nicolas, you are missing the [1] link below.  What did you intend to point to?

 Add a chapter on HDFS in the troubleshooting section of the HBase reference 
 guide.
 --

 Key: HBASE-6626
 URL: https://issues.apache.org/jira/browse/HBASE-6626
 Project: HBase
  Issue Type: Improvement
  Components: documentation
Affects Versions: 0.96.0
Reporter: nkeywal
Priority: Minor
 Attachments: troubleshooting.txt


 I looked mainly at the major failure case, but here is what I have:
 New sub chapter in the existing chapter Troubleshooting and Debugging 
 HBase: HDFS  HBASE
 1) HDFS  HBase
 2) Connection related settings
 2.1) Number of retries
 2.2) Timeouts
 3) Log samples
 1) HDFS  HBase
 HBase uses HDFS to store its HFile, i.e. the core HBase files and the 
 Write-Ahead-Logs, i.e. the files that will be used to restore the data after 
 a crash.
 In both cases, the reliability of HBase comes from the fact that HDFS writes 
 the data to multiple locations. To be efficient, HBase needs the data to be 
 available locally, hence it's highly recommended to have the HDFS datanode on 
 the same machines as the HBase Region Servers.
 Detailed information on how HDFS works can be found at [1].
 Important features are:
  - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. 
 This class can appears in HBase logs with other HDFS client related logs.
  - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS 
 side, while some other are HDFS-client-side, i.e. must be set in HBase, while 
 some other must be set in both places.
  - the HDFS writes are pipelined from one datanode to another. When writing, 
 there are communications between:
 - HBase and HDFS namenode, through the HDFS client classes.
 - HBase and HDFS datanodes, through the HDFS client classes.
 - HDFS datanode between themselves: issues on these communications are in 
 HDFS logs, not HBase. HDFS writes are always local when possible. As a 
 consequence, there should not be much write error in HBase Region Servers: 
 they write to the local datanode. If this datanode can't replicate the 
 blocks, it will appear in its logs, not in the region servers logs.
  - datanodes can be contacted through the ipc.Client interface (once again 
 this class can shows up in HBase logs) and the data transfer interface 
 (usually shows up as the DataNode class in the HBase logs). There are on 
 different ports (defaults being: 50010 and 50020).
  - To understand exactly what's going on, you must look that the HDFS log 
 files as well: HBase logs represent the client side.
  - With the default setting, HDFS needs 630s to mark a datanode as dead. For 
 this reason, this node will still be tried by HBase or by other datanodes 
 when writing and reading until HDFS definitively decides it's dead. This will 
 add some extras lines in the logs. This monitoring is performed by the 
 NameNode.
  - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on 
 the NameNode, but can mark temporally a node as dead if they had an error 
 when they tried to use it.
 2) Settings for retries and timeouts
 2.1) Retries
 ipc.client.connect.max.retries
 Default 10
 Indicates the number of retries a client will make to establish a server 
 connection. Not taken into account if the error is a SocketTimeout. In this 
 case the number of retries is 45 (fixed on branch, HADOOP-7932 or in 
 HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be 
 increased, especially if the socket timeouts have been lowered.
 ipc.client.connect.max.retries.on.timeouts
 Default 45
 If you have HADOOP-7932, max number of retries on timeout. Counter is 
 different than ipc.client.connect.max.retries so if you mix the socket errors 
 you will get 55 retries with the default values. Could be lowered, once it is 
 available. With HADOOP-7397 ipc.client.connect.max.retries is reused so there 
 would be 10 tries.
 dfs.client.block.write.retries
 Default 3
 Number of tries for the client when writing a block. After a failure, will 
 connect to the namenode a get a new location, sending the list of the 
 datanodes already tried without success. Could be increased, especially if 
 the socket timeouts have been lowered. See HBASE-6490.
 dfs.client.block.write.locateFollowingBlock.retries
 Default 5
 Number of retries to the namenode when the client got 
 NotReplicatedYetException, i.e. the existing nodes of the files are not yet 
 replicated to dfs.replication.min. This should not impact HBase, as