[ https://issues.apache.org/jira/browse/HDFS-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432364#comment-13432364 ]
Karthik Kambatla commented on HDFS-3787: ---------------------------------------- Thanks Andy. The patch looks like it should fix the race. However, I wonder if there would ever be a case where the ReplicationMonitor is interrupted and the blocksMap should not be closed. To avoid "changing" the semantics (I am not sure if it really changes), how about the following: {code} public void close() { if (replicationThread != null) { replicationThread.interrupt(); try { replicationThread.join(); } catch (InterruptedException ie) { } finally { if (pendingReplications != null) pendingReplications.stop(); blocksMap.close(); datanodeManager.close(); } } } {code} In addition to this, we can conservatively call pendingReplications.stop() in ReplicationMonitor as well? > BlockManager#close races with ReplicationMonitor#run > ---------------------------------------------------- > > Key: HDFS-3787 > URL: https://issues.apache.org/jira/browse/HDFS-3787 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Affects Versions: 2.0.0-alpha > Reporter: Andy Isaacson > Assignee: Andy Isaacson > Priority: Minor > Attachments: hdfs-3787.txt > > > We saw {{TestDirectoryScanner}} fail during shutdown: > {code} > 2012-08-09 12:17:19,844 WARN datanode.DataNode > (BPServiceActor.java:run(683)) - Ending block pool service for: Block pool > BP-610123021-172.29.121.238-1344539835759 (storage id > DS-1581877160-172.29.121.238-43609-1344539837880) service to > localhost/127.0.0.1:40012 > ... > 2012-08-09 12:17:19,876 FATAL blockmanagement.BlockManager > (BlockManager.java:run(3039)) - ReplicationMonitor thread received Runtime > exception. > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.getBlockCollection(BlocksMap.java:101) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1141) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1116) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3070) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3032) > at java.lang.Thread.run(Thread.java:662) > {code} > Inspecting the code, it appears that {{BlockManager#close -> > BlocksMap#close}} can set {{blocks}} to {{null}} while > {{computeDatanodeWork}} is running. > The fix seems simple -- have {{close}} just set an exit flag, and have > {{ReplicationMonitor#run}} call {{BlocksMap#close}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira