[jira] [Commented] (HDFS-3787) BlockManager#close races with ReplicationMonitor#run
[ https://issues.apache.org/jira/browse/HDFS-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13433530#comment-13433530 ] Eli Collins commented on HDFS-3787: --- I kicked the pre-commit build manually. BlockManager#close races with ReplicationMonitor#run Key: HDFS-3787 URL: https://issues.apache.org/jira/browse/HDFS-3787 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.0.0-alpha Reporter: Andy Isaacson Assignee: Andy Isaacson Priority: Minor Attachments: hdfs-3787-2.txt, hdfs-3787-2.txt, hdfs-3787.txt We saw {{TestDirectoryScanner}} fail during shutdown: {code} 2012-08-09 12:17:19,844 WARN datanode.DataNode (BPServiceActor.java:run(683)) - Ending block pool service for: Block pool BP-610123021-172.29.121.238-1344539835759 (storage id DS-1581877160-172.29.121.238-43609-1344539837880) service to localhost/127.0.0.1:40012 ... 2012-08-09 12:17:19,876 FATAL blockmanagement.BlockManager (BlockManager.java:run(3039)) - ReplicationMonitor thread received Runtime exception. java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.getBlockCollection(BlocksMap.java:101) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1141) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1116) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3070) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3032) at java.lang.Thread.run(Thread.java:662) {code} Inspecting the code, it appears that {{BlockManager#close - BlocksMap#close}} can set {{blocks}} to {{null}} while {{computeDatanodeWork}} is running. The fix seems simple -- have {{close}} just set an exit flag, and have {{ReplicationMonitor#run}} call {{BlocksMap#close}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3787) BlockManager#close races with ReplicationMonitor#run
[ https://issues.apache.org/jira/browse/HDFS-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432364#comment-13432364 ] Karthik Kambatla commented on HDFS-3787: Thanks Andy. The patch looks like it should fix the race. However, I wonder if there would ever be a case where the ReplicationMonitor is interrupted and the blocksMap should not be closed. To avoid changing the semantics (I am not sure if it really changes), how about the following: {code} public void close() { if (replicationThread != null) { replicationThread.interrupt(); try { replicationThread.join(); } catch (InterruptedException ie) { } finally { if (pendingReplications != null) pendingReplications.stop(); blocksMap.close(); datanodeManager.close(); } } } {code} In addition to this, we can conservatively call pendingReplications.stop() in ReplicationMonitor as well? BlockManager#close races with ReplicationMonitor#run Key: HDFS-3787 URL: https://issues.apache.org/jira/browse/HDFS-3787 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.0.0-alpha Reporter: Andy Isaacson Assignee: Andy Isaacson Priority: Minor Attachments: hdfs-3787.txt We saw {{TestDirectoryScanner}} fail during shutdown: {code} 2012-08-09 12:17:19,844 WARN datanode.DataNode (BPServiceActor.java:run(683)) - Ending block pool service for: Block pool BP-610123021-172.29.121.238-1344539835759 (storage id DS-1581877160-172.29.121.238-43609-1344539837880) service to localhost/127.0.0.1:40012 ... 2012-08-09 12:17:19,876 FATAL blockmanagement.BlockManager (BlockManager.java:run(3039)) - ReplicationMonitor thread received Runtime exception. java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.getBlockCollection(BlocksMap.java:101) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1141) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1116) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3070) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3032) at java.lang.Thread.run(Thread.java:662) {code} Inspecting the code, it appears that {{BlockManager#close - BlocksMap#close}} can set {{blocks}} to {{null}} while {{computeDatanodeWork}} is running. The fix seems simple -- have {{close}} just set an exit flag, and have {{ReplicationMonitor#run}} call {{BlocksMap#close}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3787) BlockManager#close races with ReplicationMonitor#run
[ https://issues.apache.org/jira/browse/HDFS-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432367#comment-13432367 ] Andy Isaacson commented on HDFS-3787: - That seems reasonable to me, and it keeps the closing logic in close() where it logically belongs. However it means that a hung replicationThread will hang the close() as well, if we do an unbounded join. How about {{join(3000);}}, followed by a finally block? If the join times out, assume the thread is hung and it doesn't matter if we close racily. BlockManager#close races with ReplicationMonitor#run Key: HDFS-3787 URL: https://issues.apache.org/jira/browse/HDFS-3787 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.0.0-alpha Reporter: Andy Isaacson Assignee: Andy Isaacson Priority: Minor Attachments: hdfs-3787.txt We saw {{TestDirectoryScanner}} fail during shutdown: {code} 2012-08-09 12:17:19,844 WARN datanode.DataNode (BPServiceActor.java:run(683)) - Ending block pool service for: Block pool BP-610123021-172.29.121.238-1344539835759 (storage id DS-1581877160-172.29.121.238-43609-1344539837880) service to localhost/127.0.0.1:40012 ... 2012-08-09 12:17:19,876 FATAL blockmanagement.BlockManager (BlockManager.java:run(3039)) - ReplicationMonitor thread received Runtime exception. java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.getBlockCollection(BlocksMap.java:101) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1141) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1116) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3070) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3032) at java.lang.Thread.run(Thread.java:662) {code} Inspecting the code, it appears that {{BlockManager#close - BlocksMap#close}} can set {{blocks}} to {{null}} while {{computeDatanodeWork}} is running. The fix seems simple -- have {{close}} just set an exit flag, and have {{ReplicationMonitor#run}} call {{BlocksMap#close}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3787) BlockManager#close races with ReplicationMonitor#run
[ https://issues.apache.org/jira/browse/HDFS-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432372#comment-13432372 ] Karthik Kambatla commented on HDFS-3787: Bounded timeout makes sense. I am not aware of the practice - do you think it would make sense to use one of the IPC timeouts? I ll let you decide. BlockManager#close races with ReplicationMonitor#run Key: HDFS-3787 URL: https://issues.apache.org/jira/browse/HDFS-3787 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.0.0-alpha Reporter: Andy Isaacson Assignee: Andy Isaacson Priority: Minor Attachments: hdfs-3787-2.txt, hdfs-3787.txt We saw {{TestDirectoryScanner}} fail during shutdown: {code} 2012-08-09 12:17:19,844 WARN datanode.DataNode (BPServiceActor.java:run(683)) - Ending block pool service for: Block pool BP-610123021-172.29.121.238-1344539835759 (storage id DS-1581877160-172.29.121.238-43609-1344539837880) service to localhost/127.0.0.1:40012 ... 2012-08-09 12:17:19,876 FATAL blockmanagement.BlockManager (BlockManager.java:run(3039)) - ReplicationMonitor thread received Runtime exception. java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.getBlockCollection(BlocksMap.java:101) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1141) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1116) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3070) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3032) at java.lang.Thread.run(Thread.java:662) {code} Inspecting the code, it appears that {{BlockManager#close - BlocksMap#close}} can set {{blocks}} to {{null}} while {{computeDatanodeWork}} is running. The fix seems simple -- have {{close}} just set an exit flag, and have {{ReplicationMonitor#run}} call {{BlocksMap#close}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira