[jira] [Commented] (HDFS-3787) BlockManager#close races with ReplicationMonitor#run

2012-08-13 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13433530#comment-13433530
 ] 

Eli Collins commented on HDFS-3787:
---

I kicked the pre-commit build manually.

 BlockManager#close races with ReplicationMonitor#run
 

 Key: HDFS-3787
 URL: https://issues.apache.org/jira/browse/HDFS-3787
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Andy Isaacson
Assignee: Andy Isaacson
Priority: Minor
 Attachments: hdfs-3787-2.txt, hdfs-3787-2.txt, hdfs-3787.txt


 We saw {{TestDirectoryScanner}} fail during shutdown:
 {code}
 2012-08-09 12:17:19,844 WARN  datanode.DataNode 
 (BPServiceActor.java:run(683)) - Ending block pool service for: Block pool 
 BP-610123021-172.29.121.238-1344539835759 (storage id 
 DS-1581877160-172.29.121.238-43609-1344539837880) service to 
 localhost/127.0.0.1:40012
 ...
 2012-08-09 12:17:19,876 FATAL blockmanagement.BlockManager 
 (BlockManager.java:run(3039)) - ReplicationMonitor thread received Runtime 
 exception. 
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.getBlockCollection(BlocksMap.java:101)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1141)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1116)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3070)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3032)
   at java.lang.Thread.run(Thread.java:662)
 {code}
 Inspecting the code, it appears that {{BlockManager#close - 
 BlocksMap#close}} can set {{blocks}} to {{null}} while 
 {{computeDatanodeWork}} is running.
 The fix seems simple -- have {{close}} just set an exit flag, and have 
 {{ReplicationMonitor#run}} call {{BlocksMap#close}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3787) BlockManager#close races with ReplicationMonitor#run

2012-08-09 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432364#comment-13432364
 ] 

Karthik Kambatla commented on HDFS-3787:


Thanks Andy. The patch looks like it should fix the race.

However, I wonder if there would ever be a case where the ReplicationMonitor is 
interrupted and the blocksMap should not be closed. To avoid changing the 
semantics (I am not sure if it really changes), how about the following:
{code}
   public void close() {
if (replicationThread != null) {
  replicationThread.interrupt();
  try {
replicationThread.join();
  } catch (InterruptedException ie) {
  } finally {
if (pendingReplications != null) pendingReplications.stop();
blocksMap.close();
datanodeManager.close();
  }
}
   }
{code}

In addition to this, we can conservatively call pendingReplications.stop() in 
ReplicationMonitor as well?

 BlockManager#close races with ReplicationMonitor#run
 

 Key: HDFS-3787
 URL: https://issues.apache.org/jira/browse/HDFS-3787
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Andy Isaacson
Assignee: Andy Isaacson
Priority: Minor
 Attachments: hdfs-3787.txt


 We saw {{TestDirectoryScanner}} fail during shutdown:
 {code}
 2012-08-09 12:17:19,844 WARN  datanode.DataNode 
 (BPServiceActor.java:run(683)) - Ending block pool service for: Block pool 
 BP-610123021-172.29.121.238-1344539835759 (storage id 
 DS-1581877160-172.29.121.238-43609-1344539837880) service to 
 localhost/127.0.0.1:40012
 ...
 2012-08-09 12:17:19,876 FATAL blockmanagement.BlockManager 
 (BlockManager.java:run(3039)) - ReplicationMonitor thread received Runtime 
 exception. 
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.getBlockCollection(BlocksMap.java:101)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1141)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1116)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3070)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3032)
   at java.lang.Thread.run(Thread.java:662)
 {code}
 Inspecting the code, it appears that {{BlockManager#close - 
 BlocksMap#close}} can set {{blocks}} to {{null}} while 
 {{computeDatanodeWork}} is running.
 The fix seems simple -- have {{close}} just set an exit flag, and have 
 {{ReplicationMonitor#run}} call {{BlocksMap#close}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3787) BlockManager#close races with ReplicationMonitor#run

2012-08-09 Thread Andy Isaacson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432367#comment-13432367
 ] 

Andy Isaacson commented on HDFS-3787:
-

That seems reasonable to me, and it keeps the closing logic in close() where it 
logically belongs.  However it means that a hung replicationThread will hang 
the close() as well, if we do an unbounded join.

How about {{join(3000);}}, followed by a finally block?  If the join times out, 
assume the thread is hung and it doesn't matter if we close racily.

 BlockManager#close races with ReplicationMonitor#run
 

 Key: HDFS-3787
 URL: https://issues.apache.org/jira/browse/HDFS-3787
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Andy Isaacson
Assignee: Andy Isaacson
Priority: Minor
 Attachments: hdfs-3787.txt


 We saw {{TestDirectoryScanner}} fail during shutdown:
 {code}
 2012-08-09 12:17:19,844 WARN  datanode.DataNode 
 (BPServiceActor.java:run(683)) - Ending block pool service for: Block pool 
 BP-610123021-172.29.121.238-1344539835759 (storage id 
 DS-1581877160-172.29.121.238-43609-1344539837880) service to 
 localhost/127.0.0.1:40012
 ...
 2012-08-09 12:17:19,876 FATAL blockmanagement.BlockManager 
 (BlockManager.java:run(3039)) - ReplicationMonitor thread received Runtime 
 exception. 
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.getBlockCollection(BlocksMap.java:101)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1141)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1116)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3070)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3032)
   at java.lang.Thread.run(Thread.java:662)
 {code}
 Inspecting the code, it appears that {{BlockManager#close - 
 BlocksMap#close}} can set {{blocks}} to {{null}} while 
 {{computeDatanodeWork}} is running.
 The fix seems simple -- have {{close}} just set an exit flag, and have 
 {{ReplicationMonitor#run}} call {{BlocksMap#close}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3787) BlockManager#close races with ReplicationMonitor#run

2012-08-09 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432372#comment-13432372
 ] 

Karthik Kambatla commented on HDFS-3787:


Bounded timeout makes sense. I am not aware of the practice - do you think it 
would make sense to use one of the IPC timeouts? I ll let you decide.

 BlockManager#close races with ReplicationMonitor#run
 

 Key: HDFS-3787
 URL: https://issues.apache.org/jira/browse/HDFS-3787
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Andy Isaacson
Assignee: Andy Isaacson
Priority: Minor
 Attachments: hdfs-3787-2.txt, hdfs-3787.txt


 We saw {{TestDirectoryScanner}} fail during shutdown:
 {code}
 2012-08-09 12:17:19,844 WARN  datanode.DataNode 
 (BPServiceActor.java:run(683)) - Ending block pool service for: Block pool 
 BP-610123021-172.29.121.238-1344539835759 (storage id 
 DS-1581877160-172.29.121.238-43609-1344539837880) service to 
 localhost/127.0.0.1:40012
 ...
 2012-08-09 12:17:19,876 FATAL blockmanagement.BlockManager 
 (BlockManager.java:run(3039)) - ReplicationMonitor thread received Runtime 
 exception. 
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.getBlockCollection(BlocksMap.java:101)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1141)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1116)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3070)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3032)
   at java.lang.Thread.run(Thread.java:662)
 {code}
 Inspecting the code, it appears that {{BlockManager#close - 
 BlocksMap#close}} can set {{blocks}} to {{null}} while 
 {{computeDatanodeWork}} is running.
 The fix seems simple -- have {{close}} just set an exit flag, and have 
 {{ReplicationMonitor#run}} call {{BlocksMap#close}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira