[ 
https://issues.apache.org/jira/browse/HADOOP-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541463
 ] 

stack commented on HADOOP-2040:
-------------------------------

Thanks for jumping in Konstantin.

Code-wise the shutdown looks orderly... we shutdown hbase, not leaving the 
shutdown method till all hbase servers have exited.  We then call the shutdown 
on the mini DFS which waits first on datanodes to go down and then the namenode 
(The logging above aligns; see 'Shutting down the Mini HDFS Cluster' happening 
after the exit of the HMaster thread).

But something untoward is going on.  My working theory is that an interrupted 
invocation of a native unix command ('du' by the FileSystem?) is hanging hudson 
with some frequency (See 12/Oct/07 09:18 PM comment above).  If the FS were 
quiescent, I'd imagine the hang could be avoided.

This failure has loads of complaints of 'Not able to place enough replicas, 
still in need of 1'.   They startup soon after the test starts.  What you think 
of that?  Maybe the dfs is not quiescent because its trying to replicate.  But 
I see that in MinDFSCluster it sets the replication to the number of data nodes.

> [hbase] TestHStoreFile/TestBloomFilter hang occasionally on hudson AFTER test 
> has finished
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-2040
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2040
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: stack
>            Priority: Minor
>         Attachments: endoftesttd.patch
>
>
> Weird.  Last night TestBloomFilter was hung after junit had printed test had 
> completed without error.  Just now, I noticed a hung TestHStore -- again 
> after junit had printed out test had succeeded (Nigel Daley has reported he's 
> seen at least two hangs in TestHStoreFile, perhaps in same location).
> Last night and just now I was unable to get a thread dump.
> Here is log from around this evenings hang:
> {code}
> ...
>     [junit] 2007-10-12 04:19:28,477 INFO  [main] 
> org.apache.hadoop.hbase.TestHStoreFile.testOutOfRangeMidkeyHalfMapFile(TestHStoreFile.java:366):
>  Last bottom when key > top: zz/zz/1192162768317
>     [junit] 2007-10-12 04:19:28,493 WARN  [IPC Server handler 0 on 36620] 
> org.apache.hadoop.dfs.FSDirectory.unprotectedDelete(FSDirectory.java:400): 
> DIR* FSDirectory.unprotectedDelete: failed to remove 
> /testOutOfRangeMidkeyHalfMapFile because it does not exist
>     [junit] Shutting down the Mini HDFS Cluster
>     [junit] Shutting down DataNode 1
>     [junit] Shutting down DataNode 0
>     [junit] 2007-10-12 04:19:29,316 WARN  [EMAIL PROTECTED] 
> org.apache.hadoop.dfs.PendingReplicationBlocks$PendingReplicationMonitor.run(PendingReplicationBlocks.java:186):
>  PendingReplicationMonitor thread received exception. 
> java.lang.InterruptedException: sleep interrupted
>     [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 16.274 sec
>     [junit] Running org.apache.hadoop.hbase.TestHTable
>     [junit] Starting DataNode 0 with dfs.data.dir: 
> /export/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/build/contrib/hbase/test/data/dfs/data/data1,/export/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/build/contrib/hbase/test/data/dfs/data/data2
>     [junit] Starting DataNode 1 with dfs.data.dir: 
> /export/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/build/contrib/hbase/test/data/dfs/data/data3,/export/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/build/contrib/hbase/test/data/dfs/data/data4
>     [junit] 2007-10-12 05:21:48,332 INFO  [main] 
> org.apache.hadoop.hbase.HMaster.<init>(HMaster.java:862): Root region dir: 
> /hbase/hregion_-ROOT-,,0
> ...
> {code}
> Notice the hour of elapsed (hung) time in above.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to