[jira] Updated: (HDFS-1276) Put the failed volumes in the report of HDFS status

2010-06-29 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated HDFS-1276:
-

Attachment: HDFS_1276.patch

Attach the patch. The unit test depends on the HDFS-1273. So I will attach the 
unit test after HDFS-1273 is committed

> Put the failed volumes in the report of HDFS status
> ---
>
> Key: HDFS-1276
> URL: https://issues.apache.org/jira/browse/HDFS-1276
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.21.0
>Reporter: Jeff Zhang
> Fix For: 0.21.0
>
> Attachments: HDFS_1276.patch
>
>
> Currently, users do not know which volumes are failed unless he looks into 
> the logs, this way is not convenient for users. I plan to put the failed 
> volumes in the report of HDFS. Then hdfs administers can use command 
> "bin/hadoop dfsadmin -report" to find which volumes are failed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file

2010-06-29 Thread sam rash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883785#action_12883785
 ] 

sam rash commented on HDFS-1057:


from the raw console output of hudson:

 [exec] [junit] Tests run: 3, Failures: 0, Errors: 1, Time elapsed: 
0.624 sec
 [exec] [junit] Test 
org.apache.hadoop.hdfs.security.token.block.TestBlockToken FAILED
--
 [exec] [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 
0.706 sec
 [exec] [junit] Test org.apache.hadoop.hdfs.server.common.TestJspHelper 
FAILED
--
 [exec] [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 
28.477 sec
 [exec] [junit] Test org.apache.hadoop.hdfsproxy.TestHdfsProxy FAILED

I ran the tests locally and the first 2 succeed.  The third fails on the latest 
trunk without hdfs-1057.  I think from the test perspective, this is safe to 
commit.

1. TestBlockToken

run-test-hdfs:
   [delete] Deleting directory 
/data/users/srash/apache/hadoop-hdfs/build/test/data
[mkdir] Created dir: /data/users/srash/apache/hadoop-hdfs/build/test/data
   [delete] Deleting directory 
/data/users/srash/apache/hadoop-hdfs/build/test/logs
[mkdir] Created dir: /data/users/srash/apache/hadoop-hdfs/build/test/logs
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/usr/local/ant/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:file:/home/srash/.ivy2/cache/ant/ant/jars/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running org.apache.hadoop.hdfs.security.token.block.TestBlockToken
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.248 sec


2. TestJspHelper
run-test-hdfs:
   [delete] Deleting directory 
/data/users/srash/apache/hadoop-hdfs/build/test/data
[mkdir] Created dir: /data/users/srash/apache/hadoop-hdfs/build/test/data
   [delete] Deleting directory 
/data/users/srash/apache/hadoop-hdfs/build/test/logs
[mkdir] Created dir: /data/users/srash/apache/hadoop-hdfs/build/test/logs
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/usr/local/ant/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:file:/home/srash/.ivy2/cache/ant/ant/jars/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running org.apache.hadoop.hdfs.server.common.TestJspHelper
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.275 sec


> Concurrent readers hit ChecksumExceptions if following a writer to very end 
> of file
> ---
>
> Key: HDFS-1057
> URL: https://issues.apache.org/jira/browse/HDFS-1057
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node
>Affects Versions: 0.20-append, 0.21.0, 0.22.0
>Reporter: Todd Lipcon
>Assignee: sam rash
>Priority: Blocker
> Fix For: 0.20-append
>
> Attachments: conurrent-reader-patch-1.txt, 
> conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, 
> HDFS-1057-0.20-append.patch, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, 
> hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt, 
> hdfs-1057-trunk-6.txt
>
>
> In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before 
> calling flush(). Therefore, if there is a concurrent reader, it's possible to 
> race here - the reader will see the new length while those bytes are still in 
> the buffers of BlockReceiver. Thus the client will potentially see checksum 
> errors or EOFs. Additionally, the last checksum chunk of the file is made 
> accessible to readers even though it is not stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1276) Put the failed volumes in the report of HDFS status

2010-06-29 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883764#action_12883764
 ] 

Jeff Zhang commented on HDFS-1276:
--

The new report output should like this ( the red line is the failed volumes 
information):

Configured Capacity: 118633709568 (110.49 GB)
Present Capacity: 85291678356 (79.43 GB)
DFS Remaining: 85291560960 (79.43 GB)
DFS Used: 117396 (114.64 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-
Datanodes available: 2 (2 total, 0 dead)

Live datanodes:
Name: 127.0.0.1:43452 (localhost)
Decommission Status : Normal
{color:red}Failed Volumns: 
/home/zjffdu/workspace/HDFS_trunk/build/test/data/dfs/data/data3{color}
Configured Capacity: 39544569856 (36.83 GB)
DFS Used: 44362 (43.32 KB)
Non DFS Used: 4005174 (10.35 GB)
DFS Remaining: 28430520320 (26.48 GB)
DFS Used%: 0%
DFS Remaining%: 71.89%
Last contact: Wed Jun 30 10:06:17 CST 2010


Name: 127.0.0.1:52494 (localhost)
Decommission Status : Normal
Configured Capacity: 79089139712 (73.66 GB)
DFS Used: 73034 (71.32 KB)
Non DFS Used: 8026038 (20.7 GB)
DFS Remaining: 56861040640 (52.96 GB)
DFS Used%: 0%
DFS Remaining%: 71.89%
Last contact: Wed Jun 30 10:06:19 CST 2010

> Put the failed volumes in the report of HDFS status
> ---
>
> Key: HDFS-1276
> URL: https://issues.apache.org/jira/browse/HDFS-1276
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.21.0
>Reporter: Jeff Zhang
> Fix For: 0.21.0
>
>
> Currently, users do not know which volumes are failed unless he looks into 
> the logs, this way is not convenient for users. I plan to put the failed 
> volumes in the report of HDFS. Then hdfs administers can use command 
> "bin/hadoop dfsadmin -report" to find which volumes are failed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1276) Put the failed volumes in the report of HDFS status

2010-06-29 Thread Jeff Zhang (JIRA)
Put the failed volumes in the report of HDFS status
---

 Key: HDFS-1276
 URL: https://issues.apache.org/jira/browse/HDFS-1276
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.21.0
Reporter: Jeff Zhang
 Fix For: 0.21.0


Currently, users do not know which volumes are failed unless he looks into the 
logs, this way is not convenient for users. I plan to put the failed volumes in 
the report of HDFS. Then hdfs administers can use command "bin/hadoop dfsadmin 
-report" to find which volumes are failed.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HDFS-1275) Enabling Kerberized SSL on NameNode

2010-06-29 Thread Kan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang resolved HDFS-1275.
-

Resolution: Duplicate

Closing this one as this is a duplicate HDFS-1004.

> Enabling Kerberized SSL on NameNode
> ---
>
> Key: HDFS-1275
> URL: https://issues.apache.org/jira/browse/HDFS-1275
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Kan Zhang
> Attachments: h6584-03.patch
>
>
> This is the HDFS part of HADOOP-6584.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1004) Update NN to support Kerberized SSL from HADOOP-6584

2010-06-29 Thread Kan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated HDFS-1004:


Attachment: h6584-03.patch

Uploading a new patch that is simply a porting of Jacob's patch for HADOOP-6584.

> Update NN to support Kerberized SSL from HADOOP-6584
> 
>
> Key: HDFS-1004
> URL: https://issues.apache.org/jira/browse/HDFS-1004
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Jakob Homan
>Assignee: Jakob Homan
> Attachments: h6584-03.patch, HDFS-1004.patch
>
>
> Namenode needs to be tweaked to use the new kerberized-back ssl connector.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1275) Enabling Kerberized SSL on NameNode

2010-06-29 Thread Kan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated HDFS-1275:


Attachment: h6584-03.patch

Uploading a patch that is simply a porting of Jakob's patch for HADOOP-6584.

> Enabling Kerberized SSL on NameNode
> ---
>
> Key: HDFS-1275
> URL: https://issues.apache.org/jira/browse/HDFS-1275
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Kan Zhang
> Attachments: h6584-03.patch
>
>
> This is the HDFS part of HADOOP-6584.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1004) Update NN to support Kerberized SSL from HADOOP-6584

2010-06-29 Thread Kan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883745#action_12883745
 ] 

Kan Zhang commented on HDFS-1004:
-

Sorry, I created HDFS-1275 before noticing this one. Will close HDFS-1275 as 
duplicate.

> Update NN to support Kerberized SSL from HADOOP-6584
> 
>
> Key: HDFS-1004
> URL: https://issues.apache.org/jira/browse/HDFS-1004
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Jakob Homan
>Assignee: Jakob Homan
> Attachments: HDFS-1004.patch
>
>
> Namenode needs to be tweaked to use the new kerberized-back ssl connector.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1275) Enabling Kerberized SSL on NameNode

2010-06-29 Thread Kan Zhang (JIRA)
Enabling Kerberized SSL on NameNode
---

 Key: HDFS-1275
 URL: https://issues.apache.org/jira/browse/HDFS-1275
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Kan Zhang


This is the HDFS part of HADOOP-6584.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1271) Decommissioning nodes not persisted between NameNode restarts

2010-06-29 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883700#action_12883700
 ] 

Allen Wittenauer commented on HDFS-1271:


+1.  I've just been too lazy to file a bug. :) 

> Decommissioning nodes not persisted between NameNode restarts
> -
>
> Key: HDFS-1271
> URL: https://issues.apache.org/jira/browse/HDFS-1271
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Travis Crawford
>
> Datanodes in the process of being decomissioned should still be 
> decomissioning after namenode restarts. Currently they are marked as dead 
> after a restart.
> Details:
> Nodes can be safely removed from a cluster by marking them as decomissioned 
> and waiting for their data to be replicated elsewhere. This is accomplished 
> by adding a node to the filed referenced by dfs.hosts.excluded, then 
> refreshing nodes.
> Decomissioning means block reports from the decomissioned datanode are no 
> longer accepted by the namenode, meaning for decomissioning to occur the NN 
> must have an existing block report. That is, a datanode can transition from: 
> live --> decomissioning --> dead. Nodes can NOT transition from: dead --> 
> decomissioning --> dead.
> Operationally this is problematic because intervention is required should the 
> NN restart while nodes are decomissioning, meaning in-house administration 
> tools must be more complex, or more likely admins have to babysit the 
> decomissioning process.
> Someone more familiar with the code might have a better idea, but perhaps the 
> first block report for dfs.hosts.excluded hosts should be accepted?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1258) Clearing namespace quota on "/" corrupts FS image

2010-06-29 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883699#action_12883699
 ] 

Aaron T. Myers commented on HDFS-1258:
--

Both of those test failures were failing in trunk before I created the patch.

> Clearing namespace quota on "/" corrupts FS image
> -
>
> Key: HDFS-1258
> URL: https://issues.apache.org/jira/browse/HDFS-1258
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Blocker
> Fix For: 0.20.3, 0.21.0, 0.22.0
>
> Attachments: clear-quota.patch, clear-quota.patch
>
>
> The HDFS root directory starts out with a default namespace quota of 
> Integer.MAX_VALUE. If you clear this quota (using "hadoop dfsadmin -clrQuota 
> /"), the fsimage gets corrupted immediately. Subsequent 2NN rolls will fail, 
> and the NN will not come back up from a restart.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1111) getCorruptFiles() should give some hint that the list is not complete

2010-06-29 Thread Rodrigo Schmidt (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883686#action_12883686
 ] 

Rodrigo Schmidt commented on HDFS-:
---

I thought DistributedFileSystem and Hdfs classes contact the namenode via RPC, 
using ClientProtocol. Maybe I'm missing something but I think that even if we 
change Hdfs and DistributedFileSystem, getCorruptFiles() will have to be part 
of ClientProtocol.

> getCorruptFiles() should give some hint that the list is not complete
> -
>
> Key: HDFS-
> URL: https://issues.apache.org/jira/browse/HDFS-
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Rodrigo Schmidt
>Assignee: Rodrigo Schmidt
> Attachments: HADFS-.0.patch
>
>
> If the list of corruptfiles returned by the namenode doesn't say anything if 
> the number of corrupted files is larger than the call output limit (which 
> means the list is not complete). There should be a way to hint incompleteness 
> to clients.
> A simple hack would be to add an extra entry to the array returned with the 
> value null. Clients could interpret this as a sign that there are other 
> corrupt files in the system.
> We should also do some rephrasing of the fsck output to make it more 
> confident when the list is not complete and less confident when the list is 
> known to be incomplete.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file

2010-06-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883676#action_12883676
 ] 

Hadoop QA commented on HDFS-1057:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448323/hdfs-1057-trunk-6.txt
  against trunk revision 957669.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 8 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/416/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/416/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/416/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/416/console

This message is automatically generated.

> Concurrent readers hit ChecksumExceptions if following a writer to very end 
> of file
> ---
>
> Key: HDFS-1057
> URL: https://issues.apache.org/jira/browse/HDFS-1057
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node
>Affects Versions: 0.20-append, 0.21.0, 0.22.0
>Reporter: Todd Lipcon
>Assignee: sam rash
>Priority: Blocker
> Fix For: 0.20-append
>
> Attachments: conurrent-reader-patch-1.txt, 
> conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, 
> HDFS-1057-0.20-append.patch, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, 
> hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt, 
> hdfs-1057-trunk-6.txt
>
>
> In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before 
> calling flush(). Therefore, if there is a concurrent reader, it's possible to 
> race here - the reader will see the new length while those bytes are still in 
> the buffers of BlockReceiver. Thus the client will potentially see checksum 
> errors or EOFs. Additionally, the last checksum chunk of the file is made 
> accessible to readers even though it is not stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1271) Decommissioning nodes not persisted between NameNode restarts

2010-06-29 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883655#action_12883655
 ] 

dhruba borthakur commented on HDFS-1271:


This is a *real* problem that we have faced on our clusters's too. 

> Decommissioning nodes not persisted between NameNode restarts
> -
>
> Key: HDFS-1271
> URL: https://issues.apache.org/jira/browse/HDFS-1271
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Travis Crawford
>
> Datanodes in the process of being decomissioned should still be 
> decomissioning after namenode restarts. Currently they are marked as dead 
> after a restart.
> Details:
> Nodes can be safely removed from a cluster by marking them as decomissioned 
> and waiting for their data to be replicated elsewhere. This is accomplished 
> by adding a node to the filed referenced by dfs.hosts.excluded, then 
> refreshing nodes.
> Decomissioning means block reports from the decomissioned datanode are no 
> longer accepted by the namenode, meaning for decomissioning to occur the NN 
> must have an existing block report. That is, a datanode can transition from: 
> live --> decomissioning --> dead. Nodes can NOT transition from: dead --> 
> decomissioning --> dead.
> Operationally this is problematic because intervention is required should the 
> NN restart while nodes are decomissioning, meaning in-house administration 
> tools must be more complex, or more likely admins have to babysit the 
> decomissioning process.
> Someone more familiar with the code might have a better idea, but perhaps the 
> first block report for dfs.hosts.excluded hosts should be accepted?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-233) Support for snapshots

2010-06-29 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883633#action_12883633
 ] 

dhruba borthakur commented on HDFS-233:
---

There hasn't been much work in this direction. It was deemed complex to 
implement because it needs lots of changes to the current NameNode/DataNode 
code. I have a proposal in mind that can implement "HDFS snapshots" as a layer 
on top of the current HDFS code with negligible changes to the existing 
NameNode/DataNode architecture. 

If you have any ideas regarding this , or is willing to contribute towards this 
effort, that will be great! Thanks.

> Support for snapshots
> -
>
> Key: HDFS-233
> URL: https://issues.apache.org/jira/browse/HDFS-233
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: dhruba borthakur
>Assignee: dhruba borthakur
> Attachments: Snapshots.pdf, Snapshots.pdf
>
>
> Support HDFS snapshots. It should support creating snapshots without shutting 
> down the file system. Snapshot creation should be lightweight and a typical 
> system should be able to support a few thousands concurrent snapshots. There 
> should be a way to surface (i.e. mount) a few of these snapshots 
> simultaneously.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1094) Intelligent block placement policy to decrease probability of block loss

2010-06-29 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883632#action_12883632
 ] 

dhruba borthakur commented on HDFS-1094:


@Koji: we have files with replication factor of 3. if a large number of 
datanodes fail at the same time, we do see missing blocks. Sometimes, the 
datanode process on these machines fail to start even after repeated 
start-dfs.sh attempts, sometimes the entire machine fails to reboot. Then we 
have to manually fix a few of those bad datanode machines and make them come 
online; this fixes the "missing blocks" problem but is a manual process and is 
painful.



> Intelligent block placement policy to decrease probability of block loss
> 
>
> Key: HDFS-1094
> URL: https://issues.apache.org/jira/browse/HDFS-1094
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Reporter: dhruba borthakur
>Assignee: Rodrigo Schmidt
> Attachments: prob.pdf, prob.pdf
>
>
> The current HDFS implementation specifies that the first replica is local and 
> the other two replicas are on any two random nodes on a random remote rack. 
> This means that if any three datanodes die together, then there is a 
> non-trivial probability of losing at least one block in the cluster. This 
> JIRA is to discuss if there is a better algorithm that can lower probability 
> of losing a block.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-617) Support for non-recursive create() in HDFS

2010-06-29 Thread Nicolas Spiegelberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Spiegelberg updated HDFS-617:
-

Affects Version/s: 0.20-append

> Support for non-recursive create() in HDFS
> --
>
> Key: HDFS-617
> URL: https://issues.apache.org/jira/browse/HDFS-617
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs client, name-node
>Affects Versions: 0.20-append
>Reporter: Kan Zhang
>Assignee: Kan Zhang
> Fix For: 0.21.0
>
> Attachments: h617-01.patch, h617-02.patch, h617-03.patch, 
> h617-04.patch, h617-06.patch, HDFS-617_20-append.patch
>
>
> HADOOP-4952 calls for a create call that doesn't automatically create missing 
> parent directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-617) Support for non-recursive create() in HDFS

2010-06-29 Thread Nicolas Spiegelberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Spiegelberg updated HDFS-617:
-

Attachment: HDFS-617_20-append.patch

Backport of patch for 0.20-append branch

> Support for non-recursive create() in HDFS
> --
>
> Key: HDFS-617
> URL: https://issues.apache.org/jira/browse/HDFS-617
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs client, name-node
>Affects Versions: 0.20-append
>Reporter: Kan Zhang
>Assignee: Kan Zhang
> Fix For: 0.21.0
>
> Attachments: h617-01.patch, h617-02.patch, h617-03.patch, 
> h617-04.patch, h617-06.patch, HDFS-617_20-append.patch
>
>
> HADOOP-4952 calls for a create call that doesn't automatically create missing 
> parent directories.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1274) ability to send replication traffic on a separate port to the Datanode

2010-06-29 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883613#action_12883613
 ] 

dhruba borthakur commented on HDFS-1274:


> Is this for TCP-level QoS?

precisely.

> ability to send replication traffic on a separate port to the Datanode
> --
>
> Key: HDFS-1274
> URL: https://issues.apache.org/jira/browse/HDFS-1274
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: data-node
>Reporter: dhruba borthakur
>
> The datanode receives data from a client write request or from a replication 
> request. It is useful to configure the cluster to so that dedicated bandwidth 
> is allocated for client writes and replication traffic. This requires that 
> the client-writes and replication traffic be configured to operate on two 
> different ports to the datanode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file

2010-06-29 Thread sam rash (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam rash updated HDFS-1057:
---

Status: Patch Available  (was: Open)

> Concurrent readers hit ChecksumExceptions if following a writer to very end 
> of file
> ---
>
> Key: HDFS-1057
> URL: https://issues.apache.org/jira/browse/HDFS-1057
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node
>Affects Versions: 0.20-append, 0.21.0, 0.22.0
>Reporter: Todd Lipcon
>Assignee: sam rash
>Priority: Blocker
> Fix For: 0.20-append
>
> Attachments: conurrent-reader-patch-1.txt, 
> conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, 
> HDFS-1057-0.20-append.patch, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, 
> hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt, 
> hdfs-1057-trunk-6.txt
>
>
> In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before 
> calling flush(). Therefore, if there is a concurrent reader, it's possible to 
> race here - the reader will see the new length while those bytes are still in 
> the buffers of BlockReceiver. Thus the client will potentially see checksum 
> errors or EOFs. Additionally, the last checksum chunk of the file is made 
> accessible to readers even though it is not stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file

2010-06-29 Thread sam rash (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam rash updated HDFS-1057:
---

Status: Open  (was: Patch Available)

> Concurrent readers hit ChecksumExceptions if following a writer to very end 
> of file
> ---
>
> Key: HDFS-1057
> URL: https://issues.apache.org/jira/browse/HDFS-1057
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node
>Affects Versions: 0.20-append, 0.21.0, 0.22.0
>Reporter: Todd Lipcon
>Assignee: sam rash
>Priority: Blocker
> Fix For: 0.20-append
>
> Attachments: conurrent-reader-patch-1.txt, 
> conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, 
> HDFS-1057-0.20-append.patch, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, 
> hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt, 
> hdfs-1057-trunk-6.txt
>
>
> In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before 
> calling flush(). Therefore, if there is a concurrent reader, it's possible to 
> race here - the reader will see the new length while those bytes are still in 
> the buffers of BlockReceiver. Thus the client will potentially see checksum 
> errors or EOFs. Additionally, the last checksum chunk of the file is made 
> accessible to readers even though it is not stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file

2010-06-29 Thread sam rash (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam rash updated HDFS-1057:
---

Attachment: hdfs-1057-trunk-6.txt

-fixed warnings
-fixed fd leak in some of the added tests

> Concurrent readers hit ChecksumExceptions if following a writer to very end 
> of file
> ---
>
> Key: HDFS-1057
> URL: https://issues.apache.org/jira/browse/HDFS-1057
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node
>Affects Versions: 0.20-append, 0.21.0, 0.22.0
>Reporter: Todd Lipcon
>Assignee: sam rash
>Priority: Blocker
> Fix For: 0.20-append
>
> Attachments: conurrent-reader-patch-1.txt, 
> conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, 
> HDFS-1057-0.20-append.patch, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, 
> hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt, 
> hdfs-1057-trunk-6.txt
>
>
> In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before 
> calling flush(). Therefore, if there is a concurrent reader, it's possible to 
> race here - the reader will see the new length while those bytes are still in 
> the buffers of BlockReceiver. Thus the client will potentially see checksum 
> errors or EOFs. Additionally, the last checksum chunk of the file is made 
> accessible to readers even though it is not stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1111) getCorruptFiles() should give some hint that the list is not complete

2010-06-29 Thread Sanjay Radia (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883602#action_12883602
 ] 

Sanjay Radia commented on HDFS-:


>I really think the correct design choice is to export basic APIs like 
>getCorruptFiles() as RPCs.
I suspect you have a misunderstanding of how the client side connects via RPC. 
We have no plans to expose the RPCs directly for now.
In order to allow tools to access such functionality it is not necessary to use 
the RPC directly; Hdfs and DistributedFileSystem (which extend 
AbstractFileSystem and FileSystem)  are effectively the client side library to 
access a NN. 
>On the other hand, if we do take getCorruptFiles() out of ClientProtocol, we 
>will make HDFS-1171 overly complicated or expensive.
Not if you add the method to Hdfs and DistributedFileSystem.
You simply need to make the case for adding getCorruptFIles to these two 
classes. It appears that this functionality got slipped in as part of HDFS-1171.







> getCorruptFiles() should give some hint that the list is not complete
> -
>
> Key: HDFS-
> URL: https://issues.apache.org/jira/browse/HDFS-
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Rodrigo Schmidt
>Assignee: Rodrigo Schmidt
> Attachments: HADFS-.0.patch
>
>
> If the list of corruptfiles returned by the namenode doesn't say anything if 
> the number of corrupted files is larger than the call output limit (which 
> means the list is not complete). There should be a way to hint incompleteness 
> to clients.
> A simple hack would be to add an extra entry to the array returned with the 
> value null. Clients could interpret this as a sign that there are other 
> corrupt files in the system.
> We should also do some rephrasing of the fsck output to make it more 
> confident when the list is not complete and less confident when the list is 
> known to be incomplete.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1274) ability to send replication traffic on a separate port to the Datanode

2010-06-29 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883598#action_12883598
 ] 

Todd Lipcon commented on HDFS-1274:
---

Is this for TCP-level QoS? (eg using linux traffic shaping to balance bandwidth 
allocations?)

> ability to send replication traffic on a separate port to the Datanode
> --
>
> Key: HDFS-1274
> URL: https://issues.apache.org/jira/browse/HDFS-1274
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: data-node
>Reporter: dhruba borthakur
>
> The datanode receives data from a client write request or from a replication 
> request. It is useful to configure the cluster to so that dedicated bandwidth 
> is allocated for client writes and replication traffic. This requires that 
> the client-writes and replication traffic be configured to operate on two 
> different ports to the datanode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1274) ability to send replication traffic on a separate port to the Datanode

2010-06-29 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur updated HDFS-1274:
---

Component/s: data-node

> ability to send replication traffic on a separate port to the Datanode
> --
>
> Key: HDFS-1274
> URL: https://issues.apache.org/jira/browse/HDFS-1274
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: data-node
>Reporter: dhruba borthakur
>
> The datanode receives data from a client write request or from a replication 
> request. It is useful to configure the cluster to so that dedicated bandwidth 
> is allocated for client writes and replication traffic. This requires that 
> the client-writes and replication traffic be configured to operate on two 
> different ports to the datanode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1274) ability to send replication traffic on a separate port to the Datanode

2010-06-29 Thread dhruba borthakur (JIRA)
ability to send replication traffic on a separate port to the Datanode
--

 Key: HDFS-1274
 URL: https://issues.apache.org/jira/browse/HDFS-1274
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: dhruba borthakur


The datanode receives data from a client write request or from a replication 
request. It is useful to configure the cluster to so that dedicated bandwidth 
is allocated for client writes and replication traffic. This requires that the 
client-writes and replication traffic be configured to operate on two different 
ports to the datanode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1273) Handle disk failure when writing new blocks on datanode

2010-06-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883534#action_12883534
 ] 

Hadoop QA commented on HDFS-1273:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448294/HDFS_1273.patch
  against trunk revision 957669.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 9 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/415/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/415/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/415/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/415/console

This message is automatically generated.

> Handle disk failure when writing new blocks on datanode
> ---
>
> Key: HDFS-1273
> URL: https://issues.apache.org/jira/browse/HDFS-1273
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.21.0
>Reporter: Jeff Zhang
> Fix For: 0.21.0
>
> Attachments: HDFS_1273.patch
>
>
> This issues relates to HDFS-457, in the patch of HDFS-457 only disk failure 
> when reading is handled. This jira is to handle the disk failure when writing 
> new blocks on data node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1111) getCorruptFiles() should give some hint that the list is not complete

2010-06-29 Thread Rodrigo Schmidt (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883481#action_12883481
 ] 

Rodrigo Schmidt commented on HDFS-:
---

The RaidNode is not currently using this API, although its use was one of the 
motivations I had for adding getCorruptFiles() to ClientProtocol. Originally, 
raid was part of HDFS and I could certainly see how Raid (and possibly other 
parts of HDFS) could benefit from this as an RPC to the namenode. I thought the 
others saw it too because when I got to HDFS-729, having getCorruptFiles() on 
ClientProtocol was not under discussion anymore.

The JIRA that is responsible for making the RaidNode call getCorruptFiles is 
HDFS-1171. Most probably we will have to extend DistributedFileSystem to export 
getCorruptFiles(). That's why I said we don't have to be external to HDFS, but 
we can be external to the namenode.

On the other hand, if we do take getCorruptFiles() out of ClientProtocol, we 
will make HDFS-1171 overly complicated or expensive.

I really think the correct design choice is to export basic APIs like 
getCorruptFiles() as RPCs and build services like fsck and raid completely 
outside the namenode. After looking at the fsck code from the inside out and 
having experienced how it can sometimes compromise the whole filesystem because 
the namenode is using most of its resources to calculate outputs for fsck 
requests, I'm convinced it should be outside the namenode. For the sake of 
horizontal scalability of the namenode, we should be working in redesigning 
things like the current fsck implementation, instead of reinforcing it.

That's what I meant when I said we should be taking things out of the namenode. 
In my opinion, even if my case about having other parts of HDFS call 
getCorruptFiles() is not convincing, taking it out of ClientProtocol only 
reinforces the design choice of running fsck inside the namenode, which I think 
is bad. As we have more and more discussions about a distributed namenode, 
things like fsck should be the first ones running externally to it (to the 
namenode, not to HDFS). I see this as a low-hanging fruit towards a more 
scalable and distributed namenode.


> getCorruptFiles() should give some hint that the list is not complete
> -
>
> Key: HDFS-
> URL: https://issues.apache.org/jira/browse/HDFS-
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Rodrigo Schmidt
>Assignee: Rodrigo Schmidt
> Attachments: HADFS-.0.patch
>
>
> If the list of corruptfiles returned by the namenode doesn't say anything if 
> the number of corrupted files is larger than the call output limit (which 
> means the list is not complete). There should be a way to hint incompleteness 
> to clients.
> A simple hack would be to add an extra entry to the array returned with the 
> value null. Clients could interpret this as a sign that there are other 
> corrupt files in the system.
> We should also do some rephrasing of the fsck output to make it more 
> confident when the list is not complete and less confident when the list is 
> known to be incomplete.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-457) better handling of volume failure in Data Node storage

2010-06-29 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883466#action_12883466
 ] 

Jeff Zhang commented on HDFS-457:
-

Create Jira HDFS-1273 for this issue, and put patch there

> better handling of volume failure in Data Node storage
> --
>
> Key: HDFS-457
> URL: https://issues.apache.org/jira/browse/HDFS-457
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: data-node
>Reporter: Boris Shkolnik
>Assignee: Boris Shkolnik
> Fix For: 0.21.0
>
> Attachments: HDFS-457-1.patch, HDFS-457-2.patch, HDFS-457-2.patch, 
> HDFS-457-2.patch, HDFS-457-3.patch, HDFS-457.patch, HDFS-457_20-append.patch, 
> HDFS_457.patch, jira.HDFS-457.branch-0.20-internal.patch, TestFsck.zip
>
>
> Current implementation shuts DataNode down completely when one of the 
> configured volumes of the storage fails.
> This is rather wasteful behavior because it  decreases utilization (good 
> storage becomes unavailable) and imposes extra load on the system 
> (replication of the blocks from the good volumes). These problems will become 
> even more prominent when we move to mixed (heterogeneous) clusters with many 
> more volumes per Data Node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1273) Handle disk failure when writing new blocks on datanode

2010-06-29 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated HDFS-1273:
-

Status: Patch Available  (was: Open)

> Handle disk failure when writing new blocks on datanode
> ---
>
> Key: HDFS-1273
> URL: https://issues.apache.org/jira/browse/HDFS-1273
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.21.0
>Reporter: Jeff Zhang
> Fix For: 0.21.0
>
> Attachments: HDFS_1273.patch
>
>
> This issues relates to HDFS-457, in the patch of HDFS-457 only disk failure 
> when reading is handled. This jira is to handle the disk failure when writing 
> new blocks on data node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1273) Handle disk failure when writing new blocks on datanode

2010-06-29 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated HDFS-1273:
-

Attachment: HDFS_1273.patch

Attach the patch

> Handle disk failure when writing new blocks on datanode
> ---
>
> Key: HDFS-1273
> URL: https://issues.apache.org/jira/browse/HDFS-1273
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.21.0
>Reporter: Jeff Zhang
> Fix For: 0.21.0
>
> Attachments: HDFS_1273.patch
>
>
> This issues relates to HDFS-457, in the patch of HDFS-457 only disk failure 
> when reading is handled. This jira is to handle the disk failure when writing 
> new blocks on data node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1273) Handle disk failure when writing new blocks on datanode

2010-06-29 Thread Jeff Zhang (JIRA)
Handle disk failure when writing new blocks on datanode
---

 Key: HDFS-1273
 URL: https://issues.apache.org/jira/browse/HDFS-1273
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.21.0
Reporter: Jeff Zhang
 Fix For: 0.21.0


This issues relates to HDFS-457, in the patch of HDFS-457 only disk failure 
when reading is handled. This jira is to handle the disk failure when writing 
new blocks on data node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-457) better handling of volume failure in Data Node storage

2010-06-29 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883456#action_12883456
 ] 

Eli Collins commented on HDFS-457:
--

Hey Jeff, 

Nice catch. Please file a new jira.

Thanks,
Eli

> better handling of volume failure in Data Node storage
> --
>
> Key: HDFS-457
> URL: https://issues.apache.org/jira/browse/HDFS-457
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: data-node
>Reporter: Boris Shkolnik
>Assignee: Boris Shkolnik
> Fix For: 0.21.0
>
> Attachments: HDFS-457-1.patch, HDFS-457-2.patch, HDFS-457-2.patch, 
> HDFS-457-2.patch, HDFS-457-3.patch, HDFS-457.patch, HDFS-457_20-append.patch, 
> HDFS_457.patch, jira.HDFS-457.branch-0.20-internal.patch, TestFsck.zip
>
>
> Current implementation shuts DataNode down completely when one of the 
> configured volumes of the storage fails.
> This is rather wasteful behavior because it  decreases utilization (good 
> storage becomes unavailable) and imposes extra load on the system 
> (replication of the blocks from the good volumes). These problems will become 
> even more prominent when we move to mixed (heterogeneous) clusters with many 
> more volumes per Data Node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-457) better handling of volume failure in Data Node storage

2010-06-29 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883447#action_12883447
 ] 

Jeff Zhang commented on HDFS-457:
-

This is my first patch on HDFS, not sure whether it is right to attach the 
patch here. or do need to create a new jira for this issue ?

> better handling of volume failure in Data Node storage
> --
>
> Key: HDFS-457
> URL: https://issues.apache.org/jira/browse/HDFS-457
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: data-node
>Reporter: Boris Shkolnik
>Assignee: Boris Shkolnik
> Fix For: 0.21.0
>
> Attachments: HDFS-457-1.patch, HDFS-457-2.patch, HDFS-457-2.patch, 
> HDFS-457-2.patch, HDFS-457-3.patch, HDFS-457.patch, HDFS-457_20-append.patch, 
> HDFS_457.patch, jira.HDFS-457.branch-0.20-internal.patch, TestFsck.zip
>
>
> Current implementation shuts DataNode down completely when one of the 
> configured volumes of the storage fails.
> This is rather wasteful behavior because it  decreases utilization (good 
> storage becomes unavailable) and imposes extra load on the system 
> (replication of the blocks from the good volumes). These problems will become 
> even more prominent when we move to mixed (heterogeneous) clusters with many 
> more volumes per Data Node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-457) better handling of volume failure in Data Node storage

2010-06-29 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated HDFS-457:


Attachment: HDFS_457.patch

Attach patch. Do checkDiskError in BlockReceiver before allocating volumes for 
the new block. So after checkDiskError, it is guaranteed that the volumes are 
all normal, the failed volumes has been removed.



> better handling of volume failure in Data Node storage
> --
>
> Key: HDFS-457
> URL: https://issues.apache.org/jira/browse/HDFS-457
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: data-node
>Reporter: Boris Shkolnik
>Assignee: Boris Shkolnik
> Fix For: 0.21.0
>
> Attachments: HDFS-457-1.patch, HDFS-457-2.patch, HDFS-457-2.patch, 
> HDFS-457-2.patch, HDFS-457-3.patch, HDFS-457.patch, HDFS-457_20-append.patch, 
> HDFS_457.patch, jira.HDFS-457.branch-0.20-internal.patch, TestFsck.zip
>
>
> Current implementation shuts DataNode down completely when one of the 
> configured volumes of the storage fails.
> This is rather wasteful behavior because it  decreases utilization (good 
> storage becomes unavailable) and imposes extra load on the system 
> (replication of the blocks from the good volumes). These problems will become 
> even more prominent when we move to mixed (heterogeneous) clusters with many 
> more volumes per Data Node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1258) Clearing namespace quota on "/" corrupts FS image

2010-06-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883441#action_12883441
 ] 

Hadoop QA commented on HDFS-1258:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448268/clear-quota.patch
  against trunk revision 957669.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/208/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/208/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/208/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/208/console

This message is automatically generated.

> Clearing namespace quota on "/" corrupts FS image
> -
>
> Key: HDFS-1258
> URL: https://issues.apache.org/jira/browse/HDFS-1258
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Blocker
> Fix For: 0.20.3, 0.21.0, 0.22.0
>
> Attachments: clear-quota.patch, clear-quota.patch
>
>
> The HDFS root directory starts out with a default namespace quota of 
> Integer.MAX_VALUE. If you clear this quota (using "hadoop dfsadmin -clrQuota 
> /"), the fsimage gets corrupted immediately. Subsequent 2NN rolls will fail, 
> and the NN will not come back up from a restart.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.