[ https://issues.apache.org/jira/browse/HADOOP-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541463 ]
stack commented on HADOOP-2040: ------------------------------- Thanks for jumping in Konstantin. Code-wise the shutdown looks orderly... we shutdown hbase, not leaving the shutdown method till all hbase servers have exited. We then call the shutdown on the mini DFS which waits first on datanodes to go down and then the namenode (The logging above aligns; see 'Shutting down the Mini HDFS Cluster' happening after the exit of the HMaster thread). But something untoward is going on. My working theory is that an interrupted invocation of a native unix command ('du' by the FileSystem?) is hanging hudson with some frequency (See 12/Oct/07 09:18 PM comment above). If the FS were quiescent, I'd imagine the hang could be avoided. This failure has loads of complaints of 'Not able to place enough replicas, still in need of 1'. They startup soon after the test starts. What you think of that? Maybe the dfs is not quiescent because its trying to replicate. But I see that in MinDFSCluster it sets the replication to the number of data nodes. > [hbase] TestHStoreFile/TestBloomFilter hang occasionally on hudson AFTER test > has finished > ------------------------------------------------------------------------------------------ > > Key: HADOOP-2040 > URL: https://issues.apache.org/jira/browse/HADOOP-2040 > Project: Hadoop > Issue Type: Bug > Components: contrib/hbase > Reporter: stack > Priority: Minor > Attachments: endoftesttd.patch > > > Weird. Last night TestBloomFilter was hung after junit had printed test had > completed without error. Just now, I noticed a hung TestHStore -- again > after junit had printed out test had succeeded (Nigel Daley has reported he's > seen at least two hangs in TestHStoreFile, perhaps in same location). > Last night and just now I was unable to get a thread dump. > Here is log from around this evenings hang: > {code} > ... > [junit] 2007-10-12 04:19:28,477 INFO [main] > org.apache.hadoop.hbase.TestHStoreFile.testOutOfRangeMidkeyHalfMapFile(TestHStoreFile.java:366): > Last bottom when key > top: zz/zz/1192162768317 > [junit] 2007-10-12 04:19:28,493 WARN [IPC Server handler 0 on 36620] > org.apache.hadoop.dfs.FSDirectory.unprotectedDelete(FSDirectory.java:400): > DIR* FSDirectory.unprotectedDelete: failed to remove > /testOutOfRangeMidkeyHalfMapFile because it does not exist > [junit] Shutting down the Mini HDFS Cluster > [junit] Shutting down DataNode 1 > [junit] Shutting down DataNode 0 > [junit] 2007-10-12 04:19:29,316 WARN [EMAIL PROTECTED] > org.apache.hadoop.dfs.PendingReplicationBlocks$PendingReplicationMonitor.run(PendingReplicationBlocks.java:186): > PendingReplicationMonitor thread received exception. > java.lang.InterruptedException: sleep interrupted > [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 16.274 sec > [junit] Running org.apache.hadoop.hbase.TestHTable > [junit] Starting DataNode 0 with dfs.data.dir: > /export/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/build/contrib/hbase/test/data/dfs/data/data1,/export/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/build/contrib/hbase/test/data/dfs/data/data2 > [junit] Starting DataNode 1 with dfs.data.dir: > /export/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/build/contrib/hbase/test/data/dfs/data/data3,/export/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/build/contrib/hbase/test/data/dfs/data/data4 > [junit] 2007-10-12 05:21:48,332 INFO [main] > org.apache.hadoop.hbase.HMaster.<init>(HMaster.java:862): Root region dir: > /hbase/hregion_-ROOT-,,0 > ... > {code} > Notice the hour of elapsed (hung) time in above. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.