[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-25 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283197#comment-13283197
 ] 

Konstantin Shvachko commented on HDFS-3368:
---

I committed to trunk and branch-0.22.
Could you guys commit to 2.0, my old box has space only for two versions. I 
think the trunk patch will not require any changes.

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0-alpha, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-0.22.patch, blockDeletePolicy-trunk.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282983#comment-13282983
 ] 

Hadoop QA commented on HDFS-3368:
-

+1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12528559/blockDeletePolicy-trunk.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 1 new or modified test 
files.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/2516//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2516//console

This message is automatically generated.

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0-alpha, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-0.22.patch, blockDeletePolicy-trunk.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-24 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282880#comment-13282880
 ] 

Konstantin Shvachko commented on HDFS-3368:
---

No, all failures are unrelated to the patch.
I looked through the Jenkins logs.

# org.apache.hadoop.hdfs.TestDFSClientRetries.testGetFileChecksum 
This one failes because previous test sets xceiver count in config to 2 and 
never resets it back. So creation of a large file in testGetFileChecksum 
eventually fails, because DNs refuse to add more xceiver threads.
{code}
java.io.IOException: Xceiver count 3 exceeds the limit of concurrent xcievers: 2
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:143)
at java.lang.Thread.run(Thread.java:662)
{code}
# 
org.apache.hadoop.hdfs.TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy1
 
Failes because DFSTestUtil.waitCorruptReplicas() is timing- / delay- sensitive.
It reads some file 50 times and checks if the corruption is detected after each 
read.
That time was enough for the DN to restart, but not enough for NN to detect the 
corruption.
Looking for "NameSystem.addToCorruptReplicasMap:" and it is not in the logs.
By the way testBlockCorruptionRecoveryPolicy2 which corrupts 2 replicas onstead 
of one worked fine.
# 
org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks.testCorruptBlockRereplicatedAcrossRacks
failes for the same reason. I see fifty "Waiting for 1 corrupt replicas", which 
means 50 read have been done, but no "addToCorruptReplicasMap" indicating that 
corruption was not detected.

I can file jiras for that.

Resubmitted the build in case I missed something.

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0-alpha, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-0.22.patch, blockDeletePolicy-trunk.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-22 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281146#comment-13281146
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3368:
--

Patch looks good but the failed tests seem related.

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-0.22.patch, blockDeletePolicy-trunk.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280865#comment-13280865
 ] 

Hadoop QA commented on HDFS-3368:
-

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12528559/blockDeletePolicy-trunk.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 1 new or modified test 
files.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks
  org.apache.hadoop.hdfs.TestDFSClientRetries
  org.apache.hadoop.hdfs.TestDatanodeBlockScanner

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/2503//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2503//console

This message is automatically generated.

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-0.22.patch, blockDeletePolicy-trunk.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-21 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280426#comment-13280426
 ] 

Aaron T. Myers commented on HDFS-3368:
--

I think the problem is not in Hadoop having _too many_ configuration options 
per se, but rather in having _too many that one needs to change_. The key 
difference being that we just need to give it a good default value. Making it 
undocumented is fine and perhaps even desirable, since you're right - most 
people will never need or want to change it. But, if someone (maybe me or you, 
one day) does find some need to change it, they'll be very happy they're able 
to do so without either recompiling or waiting for the next Hadoop release. I 
see no benefit to having it hard-coded as opposed to an undocumented config 
parameter, and some potential benefit to having it configurable.

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-21 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280418#comment-13280418
 ] 

Konstantin Shvachko commented on HDFS-3368:
---

I incorporated Nicholas's comments.
Double checking on the configuration parameter before changing it. 
Aaron, sounds like you would generally prefer configuration parameters to 
constants. But this is not always a good approach. There were many discussions 
in the past about it. People are getting confused if you give them too many 
parameters that they have to understand in order to change. This one is hardly 
to be ever changed by anybody. 
So I am asking whether you feel strong about making it an undocumented config 
parameter.

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-20 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279930#comment-13279930
 ] 

Aaron T. Myers commented on HDFS-3368:
--

bq. I was thinking about making it configurable, but decided not to introduce 
yet another parameter. The value is chosen based on experimental data. If you 
feel strongly about making it a configuration key, I can add an undocumented 
parameter.

IMO, we should add a configuration option for it. Even if it's unlikely to 
change, if someone does want to change it they'll thank us that they don't have 
to change the code/recompile to do so.

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-18 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279390#comment-13279390
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3368:
--

I just found that the enableDebugLogging constant is outdated.  It still refers 
to "FSNamesystem logger" but the actually logger is BlockPlacementPolicy.LOG.  
Could you also update it?  Below is my suggested change.
{code}
 public class BlockPlacementPolicyDefault extends BlockPlacementPolicy {
+  private static final String enableDebugLogging
+  = "For more information, please enable DEBUG log level on "
+  + ((Log4JLogger)LOG).getLogger().getName();
+
   private boolean considerLoad; 
   private boolean preferLocalNode = true;
   private NetworkTopology clusterMap;
   private FSClusterStats stats;
-  static final String enableDebugLogging = "For more information, please 
enable"
-+ " DEBUG level logging on the "
-+ "org.apache.hadoop.hdfs.server.namenode.FSNamesystem logger.";
{code}

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-18 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279370#comment-13279370
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3368:
--

If you think that it is not like to change the value of the multiplier, then we 
don't have to make it configurable.

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-18 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279371#comment-13279371
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3368:
--

Typo: like => likely

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-18 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279368#comment-13279368
 ] 

Konstantin Shvachko commented on HDFS-3368:
---

Thanks for the review.
Name change makes sense.
I was thinking about making it configurable, but decided not to introduce yet 
another parameter. The value is chosen based on experimental data. If you feel 
strongly about making it a configuration key, I can add an undocumented 
parameter.

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-18 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279263#comment-13279263
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3368:
--

If a datanode somehow has a long heartbeat interval, it does indicate that the 
datanode may have some problem.  Also, the patch is quite simple.  So, I am 
fine on changing the default replication policy.  Some comments on the patch:

- How about renaming DFS_TOLERATE_HEARTBEAT_MISSES to 
DFS_TOLERATE_HEARTBEAT_MULTIPLIER?

- Since we are not sure what is a good choice of the multiplier, how about 
making it configurable?

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-18 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279247#comment-13279247
 ] 

Konstantin Shvachko commented on HDFS-3368:
---

This is pretty rare but when you hit it it takes a while to figure out what 
went wrong. If not fixed the problem becomes a maintenance issue, that is ops 
will have to remember to add every failed node to the exclude list, which 
sometimes is not obvious and definitely time consuming.

Block allocation does not take into account heartbeats. As you know there are 
other mechanisms there, like DN load. Even if new replica is assigned to a node 
that has recently gone down, this will be detected during data transfer and a 
new location will be assigned.
Don't see how it should correlate with the delete policy.

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-18 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279039#comment-13279039
 ] 

Suresh Srinivas commented on HDFS-3368:
---

bq. Your question about complexity is similar to one of why do we bother 
introducing all the HA complexity if all NameNodes primary and standby can fail 
at once.

:-) I do see the relationship of this to my comments.

I am not saying this problem should not be solved. Given how rare it is, why 
change the default behavior. If others feel this makes sense, I am okay with 
it. Note that the problem is solved only for some cases, that is excess replica 
deletion and not during block allocation.

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-18 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278968#comment-13278968
 ] 

Konstantin Shvachko commented on HDFS-3368:
---

You are right a failure of three random nodes leads to a data loss. We know 
that and cannot do anything about it.

The case is different here. The cluster has 6 replicas and ends up with 0 _as 
the result of the current policy_, which makes it almost *inevitable*. This can 
be avoided by a slight modification in the policy making it smarter about 
potentially flaky nodes.

Your question about complexity is similar to one of why do we bother 
introducing all the HA complexity if all NameNodes primary and standby can fail 
at once.

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-11 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273690#comment-13273690
 ] 

Suresh Srinivas commented on HDFS-3368:
---

bq. They are not chosen for new blocks. This is a different scenario.
I am saying if such flaky nodes are around, the problem I described can happen 
as well.

My concern is, is it necessary to add this complexity into default policy?

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-11 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273669#comment-13273669
 ] 

Konstantin Shvachko commented on HDFS-3368:
---

> d01, do2, do3 are chosen for adding new block.

They are not chosen for new blocks. This is a different scenario.
do[1-3] went down long time ago (and all blocks were replicated out to other 
nodes), but were not put into exclude list.
*On cluster restart* do[1-3] are brought up along with dn[1-3]. So for a brief 
period of time the block had 6 replicas. 3 of them need to be deleted. Because 
of the current default policy in place the replicas will be chosen to be 
deleted from dn[1-3], because those have less free space. do[1-3] are flaky and 
die shortly after sending block reports on restart. So 10 minutes later all 6 
replicas will be gone.
Just as I described in my first comment. The bug is in the default policy. I'm 
not defining a new one.

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-11 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273617#comment-13273617
 ] 

Suresh Srinivas commented on HDFS-3368:
---

Sorry, I should have added more details to my comment:
In your description of the problem first the failure is one by one - "At 
different times all three nodes malfunctioned and died, causing the replicas to 
migrate to dn1, dn2, dn3." Later the failure is together in a short time 
"Expectedly do1, do2, do3 malfunction again and go down shortly after reporting 
their blocks to NN".

While you change how you choose the replicas to delete, the presence of nodes 
like do1, do2 and do3 means that the following scenario is possible:
* d01, do2, do3 are chosen for adding new block.
* client adds a block to these nodes.
* shortly all do1, do2, do3 go down shortly.
Now the replicas are no longer available.

HDFS multiple replicas assumes the probability of three nodes having same 
replicas going down altogether in a short time is low. Given that not sure if 
this problem is important enough. 

Alternatively, given block placement policy is pluggable, you could write a 
custom implementation and not change the default implementation?



> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-11 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273609#comment-13273609
 ] 

Suresh Srinivas commented on HDFS-3368:
---

bq. Expectedly do1, do2, do3 malfunction again and go down shortly after 
reporting their blocks to NN.
Konstantin, I am not sure how likely do1, do2 and do3 all go down with in 10 
minutes. Is it because these are flaky nodes?

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273032#comment-13273032
 ] 

Hadoop QA commented on HDFS-3368:
-

+1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12526469/blockDeletePolicy-trunk.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 1 new or modified test 
files.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/2421//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2421//console

This message is automatically generated.

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy-0.22.patch, 
> blockDeletePolicy-trunk.patch, blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-04 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268904#comment-13268904
 ] 

Todd Lipcon commented on HDFS-3368:
---

bq. Todd, if you want to double "minus heartbeat interval", please get some 
motivation for that. I mean why not triple or 1.5 x.

Sure, either of those seem fine too. Just meant to express that using the 
heartbeat interval itself may not be right, since the actual receipt of the 
heart beats may jitter a bit around the configured value (eg if a DN has to 
make a block report, or otherwise gets held up by a couple seconds).

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: blockDeletePolicy.patch
>
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-04 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268615#comment-13268615
 ] 

Todd Lipcon commented on HDFS-3368:
---

bq. first look at the oldest heartbeat time, and second at the free space, when 
all heartbeats are within the heartbeat interval

Seems we need some kind of slop factor on the heartbeat interval, right? eg if 
a DN is within 2x the heartbeat interval then it's considered "probably alive" 
for this policy.

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-04 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268591#comment-13268591
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3368:
--

Hi Konstantin,

You proposal sounds good.

BTW, I beg you could manually copy the blocks from do1, do2 and do3.  Is it 
case?  If yes, then it is actually data unavailability but not data loss.

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.

2012-05-04 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268200#comment-13268200
 ] 

Konstantin Shvachko commented on HDFS-3368:
---

I propose to adjust {{BlockPlacementPolicyDefault.chooseReplicaToDelete()}} to 
first look at the oldest heartbeat time, and second at the free space, when all 
heartbeats are within the heartbeat interval.
With such policy in the scenario above the replicas for deletion are most 
likely to be assigned to do1, do2, do3, but will never be deleted, because the 
old nodes have already died. NN will automatically remove replicas from the 
live ones 10 minutes later or so. 
Also when only one or two DNs malfunction in the similar scenario this will 
reduce unnecessary deletions and replications.
No change in policy will be seen in regular case when all nodes function 
properly.

> Missing blocks due to bad DataNodes comming up and down.
> 
>
> Key: HDFS-3368
> URL: https://issues.apache.org/jira/browse/HDFS-3368
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira