[ 
https://issues.apache.org/jira/browse/HBASE-21666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16778462#comment-16778462
 ] 

Tak Lon (Stephen) Wu edited comment on HBASE-21666 at 2/26/19 6:54 PM:
-----------------------------------------------------------------------

I have done investigation below, and I found the hanging/slow is related to 
test node's network setup and local disk issue. I'd like to propose the 
solution to be fail fast instead of timeout at 780+ when possible.

First of all, test methods in {{TestExportSnapshot}} contains two phases of 
operations, operations in Mini HBase Cluster and operations in Mini MR Cluster, 
and we are only snapshotting 50 rows into a test table (the data is very small).

So, the timeout issue is related the followings
 1. the building node has an `incorrect` network interface setup such that 
      a. it hangs the HDFS file operations e.g.
{quote}2019-02-25 22:28:36,099 ERROR [ClientFinalizer-shutdown-hook] 
hdfs.DFSClient(949): Failed to close inode 16420
 java.io.EOFException: End of File Exception between local host is: 
"f45c89a57f29.ant.amazon.com/192.168.1.15"; destination host is: 
"localhost":54524; : java.io.EOFException; For more details see: 
[http://wiki.apache.org/hadoop/EOFException]
{quote}
    b. server (region server or hmaster) cannot be connected or regions cannot 
be assigned and kept retrying till timeout, e.g.
{quote}2019-02-26 09:27:54,754 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=4,queue=0,port=57922] 
client.RpcRetryingCallerImpl(132): Call exception, tries=10, retries=19, 
started=96205 ms ago, cancelled=false, msg=Call to 
f45c89a57f29-2.local/10.63.166.57:57926 failed on local exception: 
org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed 
servers list: f45c89a57f29-2.local/10.63.166.57:57926, details=row 
'testtb-testExportFileSystemStateWithSkipTmp' on table 'hbase:meta' at 
region=hbase:meta,,1.1588230740, 
hostname=f45c89a57f29-2.local,57926,1551201763075, seqNum=-1, see 
[https://s.apache.org/timeout], 
exception=org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
f45c89a57f29-2.local/10.63.166.57:57926 failed on local exception: 
org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed 
servers list: f45c89a57f29-2.local/10.63.166.57:57926`
{quote}
2. the building node has an out of disk space issue such node manager is not in 
the health state, e.g. I saw from the node manger UI {{1/1 local-dirs are bad: 
/yarn/nm; 1/1 log-dirs are bad: /yarn/container-logs}} even if we have set 
{{yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage}}
 to 99%

In above cases, assuming case 1) is an node setup issues (e.g. in 
{{/etc/hosts}}) that can be fixed by the infra admin or the contributor who is 
running the unit test on their laptop/machine, we don't need to fix it.

for case 2), I'm thinking to set a new value 
{{yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb}} to 128MB 
(should be enough for log-dirs and local-dirs) to fail fast when starting the 
miniMRCluster by 
{{[TestExportSnapshot#setUpBeforeClass|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/snapshot/TestExportSnapshot.java#L100-L104]}}
 instead of timeout for 780+ seconds

In fact, if the building node does not have any of the connections and disk 
issues, the average time of running all (7) tests within {{TestExportSnapshot}} 
is about 280 seconds and IMO it won't be able to speedup with splitting some of 
the test methods into a separate classes and tests of each class are executed 
in a sequential order (are we running tests in parallel especially for 
{{TestExportSnapshot}} which labeled as {{LargeTests}}? when I tested with 
{{mvn test -PrunAllTests -Dtest=TestExportSnapshot}}, I didn't see methods are 
running concurrently even if I found the {{surefire.secondPartForkCount=5}} for 
{{runAllTests}}, but if anyone confirm it does, we can also separate each 
method in {{TestExportSnapshot}} to different classes).

So, if we think disk space issue of YARN's nodemanager should be failed fast 
when running tests, proposed code change in 
{{HBaseTestingUtility#startMiniMapReduceCluster}} should be as below.

Any comments?
{code:java}
@@ -2736,6 +2736,8 @@ public class HBaseTestingUtility extends 
HBaseZKTestingUtility {
     conf.setIfUnset(
         
"yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage",
         "99.0");
+    // Make sure we have enough disk space for log-dirs and local-dirs
+    
conf.setIfUnset("yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb",
 "128");
     startMiniMapReduceCluster(2);
     return mrCluster;
   }

{code}


was (Author: taklwu):
I have done investigation below, and I found the hanging/slow is related to 
test node's network setup and local disk issue. I'd like to propose the 
solution to be fail fast instead of timeout at 780+ when possible. 

First of all, test methods in {{TestExportSnapshot}} contains two phases of 
operations, operations in Mini HBase Cluster and operations in Mini MR Cluster, 
and we are only snapshotting 50 rows into a test table (the data is very small).

So, the timeout issue is related the followings
 1. the building node has an `incorrect` network interface setup such that 
      a. it hangs the HDFS file operations e.g.
{quote}2019-02-25 22:28:36,099 ERROR [ClientFinalizer-shutdown-hook] 
hdfs.DFSClient(949): Failed to close inode 16420
 java.io.EOFException: End of File Exception between local host is: 
"f45c89a57f29.ant.amazon.com/192.168.1.15"; destination host is: 
"localhost":54524; : java.io.EOFException; For more details see: 
[http://wiki.apache.org/hadoop/EOFException]
{quote}
    b. server (region server or hmaster) cannot be connected or regions cannot 
be assigned and kept retrying till timeout, e.g.
{quote}2019-02-26 09:27:54,754 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=4,queue=0,port=57922] 
client.RpcRetryingCallerImpl(132): Call exception, tries=10, retries=19, 
started=96205 ms ago, cancelled=false, msg=Call to 
f45c89a57f29-2.local/10.63.166.57:57926 failed on local exception: 
org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed 
servers list: f45c89a57f29-2.local/10.63.166.57:57926, details=row 
'testtb-testExportFileSystemStateWithSkipTmp' on table 'hbase:meta' at 
region=hbase:meta,,1.1588230740, 
hostname=f45c89a57f29-2.local,57926,1551201763075, seqNum=-1, see 
[https://s.apache.org/timeout], 
exception=org.apache.hadoop.hbase.ipc.FailedServerException: Call to 
f45c89a57f29-2.local/10.63.166.57:57926 failed on local exception: 
org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed 
servers list: f45c89a57f29-2.local/10.63.166.57:57926`
{quote}
2. the building node has an out of disk space issue such node manager is not in 
the health state, e.g. I saw from the node manger UI {{1/1 local-dirs are bad: 
/yarn/nm; 1/1 log-dirs are bad: /yarn/container-logs}} even if we have set 
{{yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage}}
 to 99%

In above cases, assuming case 1) is an node setup issues (e.g. in 
{{/etc/hosts}}) that can be fixed by the infra admin or the contributor who is 
running the unit test on their laptop/machine, we don't need to fix it.

for case 2), I'm thinking to set a new value 
{{yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb}} to 128MB 
(should be enough for log-dirs and local-dirs) to fail fast when starting the 
miniMRCluster by 
{{[TestExportSnapshot#setUpBeforeClass|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/snapshot/TestExportSnapshot.java#L100-L104]}}
 instead of timeout for 780+ seconds

In fact, if the building node does not have any of the connections and disk 
issues, the average time of running all tests within {{TestExportSnapshot}} is 
about 280 seconds and IMO it won't be able to speedup with splitting some of 
the test methods into a separate classes and tests of each class are executed 
in a sequential order (are we running tests in parallel especially for 
{{TestExportSnapshot}} which labeled as {{LargeTests}}? when I tested with 
{{mvn test -PrunAllTests -Dtest=TestExportSnapshot}}, I didn't see methods are 
running concurrently even if I found the {{surefire.secondPartForkCount=5}} for 
{{runAllTests}}).

So, if we think disk space issue of YARN's nodemanager should be failed fast 
when running tests, proposed code change in 
{{HBaseTestingUtility#startMiniMapReduceCluster}} should be as below.

Any comments?
{code:java}
@@ -2736,6 +2736,8 @@ public class HBaseTestingUtility extends 
HBaseZKTestingUtility {
     conf.setIfUnset(
         
"yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage",
         "99.0");
+    // Make sure we have enough disk space for log-dirs and local-dirs
+    
conf.setIfUnset("yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb",
 "128");
     startMiniMapReduceCluster(2);
     return mrCluster;
   }

{code}

> Break up the TestExportSnapshot UTs; they can timeout
> -----------------------------------------------------
>
>                 Key: HBASE-21666
>                 URL: https://issues.apache.org/jira/browse/HBASE-21666
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>            Reporter: stack
>            Assignee: Tak Lon (Stephen) Wu
>            Priority: Major
>              Labels: beginner
>
> These timed out for [~Apache9] when he ran with the -PrunAllTests. Suggests 
> breaking them up into smaller tests so less likely they'll timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to