[ https://issues.apache.org/jira/browse/HBASE-21666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16778462#comment-16778462 ]
Tak Lon (Stephen) Wu commented on HBASE-21666: ---------------------------------------------- I have done investigation below, and I found the hanging/slow is related to test node's network setup and local disk issue. I'd like to propose the solution to be fail fast instead of timeout at 780+ when possible. First of all, test methods in {{TestExportSnapshot}} contains two phases of operations, operations in Mini HBase Cluster and operations in Mini MR Cluster, and we are only snapshotting 50 rows into a test table (the data is very small). So, the timeout issue is related the followings 1. the building node has an `incorrect` network interface setup such that a. it hangs the HDFS file operations e.g. {quote}2019-02-25 22:28:36,099 ERROR [ClientFinalizer-shutdown-hook] hdfs.DFSClient(949): Failed to close inode 16420 java.io.EOFException: End of File Exception between local host is: "f45c89a57f29.ant.amazon.com/192.168.1.15"; destination host is: "localhost":54524; : java.io.EOFException; For more details see: [http://wiki.apache.org/hadoop/EOFException] {quote} b. server (region server or hmaster) cannot be connected or regions cannot be assigned and kept retrying till timeout, e.g. {quote}2019-02-26 09:27:54,754 DEBUG [RpcServer.default.FPBQ.Fifo.handler=4,queue=0,port=57922] client.RpcRetryingCallerImpl(132): Call exception, tries=10, retries=19, started=96205 ms ago, cancelled=false, msg=Call to f45c89a57f29-2.local/10.63.166.57:57926 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: f45c89a57f29-2.local/10.63.166.57:57926, details=row 'testtb-testExportFileSystemStateWithSkipTmp' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=f45c89a57f29-2.local,57926,1551201763075, seqNum=-1, see [https://s.apache.org/timeout], exception=org.apache.hadoop.hbase.ipc.FailedServerException: Call to f45c89a57f29-2.local/10.63.166.57:57926 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: f45c89a57f29-2.local/10.63.166.57:57926` {quote} 2. the building node has an out of disk space issue such node manager is not in the health state, e.g. I saw from the node manger UI {{1/1 local-dirs are bad: /yarn/nm; 1/1 log-dirs are bad: /yarn/container-logs}} even if we have set {{yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage}} to 99% In above cases, assuming case 1) is an node setup issues (e.g. in {{/etc/hosts}}) that can be fixed by the infra admin or the contributor who is running the unit test on their laptop/machine, we don't need to fix it. for case 2), I'm thinking to set a new value {{yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb}} to 128MB (should be enough for log-dirs and local-dirs) to fail fast when starting the miniMRCluster by {{[TestExportSnapshot#setUpBeforeClass|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/snapshot/TestExportSnapshot.java#L100-L104]}} instead of timeout for 780+ seconds In fact, if the building node does not have any of the connections and disk issues, the average time of running all tests within {{TestExportSnapshot}} is about 280 seconds and IMO it won't be able to speedup with splitting some of the test methods into a separate classes and tests of each class are executed in a sequential order (are we running tests in parallel especially for {{TestExportSnapshot}} which labeled as {{LargeTests}}? when I tested with {{mvn test -PrunAllTests -Dtest=TestExportSnapshot}}, I didn't see methods are running concurrently even if I found the {{surefire.secondPartForkCount=5}} for {{runAllTests}}). So, if we think disk space issue of YARN's nodemanager should be failed fast when running tests, proposed code change in {{HBaseTestingUtility#startMiniMapReduceCluster}} should be as below. Any comments? {code:java} @@ -2736,6 +2736,8 @@ public class HBaseTestingUtility extends HBaseZKTestingUtility { conf.setIfUnset( "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage", "99.0"); + // Make sure we have enough disk space for log-dirs and local-dirs + conf.setIfUnset("yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb", "128"); startMiniMapReduceCluster(2); return mrCluster; } {code} > Break up the TestExportSnapshot UTs; they can timeout > ----------------------------------------------------- > > Key: HBASE-21666 > URL: https://issues.apache.org/jira/browse/HBASE-21666 > Project: HBase > Issue Type: Bug > Components: test > Reporter: stack > Assignee: Tak Lon (Stephen) Wu > Priority: Major > Labels: beginner > > These timed out for [~Apache9] when he ran with the -PrunAllTests. Suggests > breaking them up into smaller tests so less likely they'll timeout. -- This message was sent by Atlassian JIRA (v7.6.3#76005)