[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067108#comment-13067108 ] Eli Collins commented on HDFS-2011: --- Ah, right, abort is new in 1073, I think the tests you have for trunk are fine for now. The new test with abort will come in when 1073 is merged. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Fix For: 0.23.0 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch, elfos-close-patch-on-1073-2.txt, elfos-close-patch-on-1073-3.txt, elfos-close-patch-on-1073.txt Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13066603#comment-13066603 ] Ravi Prakash commented on HDFS-2011: @Eli - The new patch tests abort() but I couldn't find the method in EditLogFileOutputStream.java in trunk. Its available in HDFS-1073. Could you please point me to what exactly I should incorporate into trunk? Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Fix For: 0.23.0 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch, elfos-close-patch-on-1073-2.txt, elfos-close-patch-on-1073-3.txt, elfos-close-patch-on-1073.txt Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13065464#comment-13065464 ] Todd Lipcon commented on HDFS-2011: --- Committed elfos-close-patch-on-1073-3.txt to the HDFS-1073 branch to fix the test case. Thanks Eli. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Fix For: 0.23.0 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch, elfos-close-patch-on-1073-2.txt, elfos-close-patch-on-1073-3.txt, elfos-close-patch-on-1073.txt Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13065667#comment-13065667 ] Hudson commented on HDFS-2011: -- Integrated in Hadoop-Hdfs-1073-branch #9 (See [https://builds.apache.org/job/Hadoop-Hdfs-1073-branch/9/]) Amend HDFS-2011 for HDFS-1073 branch. Update test cases for new behavior of EditLogFileOutputStream. Contributed by Todd Lipcon and Eli Collins. todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1146848 Files : * /hadoop/common/branches/HDFS-1073/hdfs/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java * /hadoop/common/branches/HDFS-1073/hdfs/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Fix For: 0.23.0 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch, elfos-close-patch-on-1073-2.txt, elfos-close-patch-on-1073-3.txt, elfos-close-patch-on-1073.txt Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13064880#comment-13064880 ] John George commented on HDFS-2011: --- +1 Looks good to me. Thanks Eli. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Fix For: 0.23.0 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch, elfos-close-patch-on-1073-2.txt, elfos-close-patch-on-1073.txt Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061273#comment-13061273 ] John George commented on HDFS-2011: --- The patch looks good. Shouldn't the sequence close() close() be tested as well, since that could be a case that could possibly happen? Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Fix For: 0.23.0 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch, elfos-close-patch-on-1073.txt Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060745#comment-13060745 ] Todd Lipcon commented on HDFS-2011: --- I'm working on merging this with HDFS-1073, and had one question: when do we expect that an editlog stream would be closed twice? In 1073 there are some extra asserts, so instead of ignoring the second close, it now throws java.io.IOException: Trying to use aborted output stream. I'm debating whether to remove this exception like you've done in this patch, vs remove the patch, since it seems like it might be indicative of a bug to close a stream twice. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Fix For: 0.23.0 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060775#comment-13060775 ] Ravi Prakash commented on HDFS-2011: I had noticed close being called twice while testing this functionality . This was causing a NullPointerException the second time. The stack trace is given in comment https://issues.apache.org/jira/browse/HDFS-2011?focusedCommentId=13041858page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13041858 {quote} 2011-04-05 17:36:56,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 87 on 8020, call getEditLogSize() from 98.137.97.99:35862: error: java.io.IOException: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.close(EditLogFileOutputStream.java:109) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.processIOError(FSEditLog.java:299) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.getEditLogSize(FSEditLog.java:849) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getEditLogSize(FSNamesystem.java:4270) at org.apache.hadoop.hdfs.server.namenode.NameNode.getEditLogSize(NameNode.java:1095) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:346) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1399) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1395) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1094) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1393) {quote} The bug itself is quite hard to reproduce. I had to run my tests in an infinite loop and the NullPointerException happened after 3-4 hours (each run of the test would take 2 mins maybe). After the NullPointerException, the namenode would essentially be useless. Even hdfs dfs -ls would throw a NullPointerException. I am not sure myself which philosophy would be better. FileOutputStream itself ignores a second close. I checked this with the following program {noformat} import java.io.*; public class TestJAVA { public static void main(String args[]) { System.out.println(Hello World); try { FileOutputStream fos = new FileOutputStream(/tmp/ravi.txt); fos.write(50); fos.write(50); fos.write(50); fos.write(50); fos.write(50); fos.write(50); fos.close(); fos.close(); } catch (IOException ioe) { System.out.println(Hello California); System.out.println (ioe); } System.out.println(Hello Champaign); } } {noformat} Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Fix For: 0.23.0 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060777#comment-13060777 ] Ravi Prakash commented on HDFS-2011: The program above output {noformat} Hello World Hello Champaign {noformat} Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Fix For: 0.23.0 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060780#comment-13060780 ] John George commented on HDFS-2011: --- If I remember right, it was a case of an incomplete create as opposed to close being called twice. So, the close() was being called on a stream that was not really created... Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Fix For: 0.23.0 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060812#comment-13060812 ] Todd Lipcon commented on HDFS-2011: --- In the HDFS-1073 branch, EditLogOutputStream now has separate close() and abort() methods. abort() is used when there has been some error on the stream and we expect to do an unclean close (ie without flushing). close() is used for clean closes. If close() itself fails, it will then proceed to abort() when the IO error is handled. So, I think the correct test case on the branch is to call abort() twice and make sure that's ignored, or call close() and then abort() to make sure that's ignored. Does that sound reasonable? Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Fix For: 0.23.0 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060853#comment-13060853 ] John George commented on HDFS-2011: --- I think calling 1. abort() twice 2. close() twice 3. close() followed by an abort() would test most cases. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Fix For: 0.23.0 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13058532#comment-13058532 ] Hudson commented on HDFS-2011: -- Integrated in Hadoop-Hdfs-trunk #712 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/712/]) HDFS-2011. Removal and restoration of storage directories on checkpointing failure doesn't work properly. Contributed by Ravi Prakash. mattf : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1141748 Files : * /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java * /hadoop/common/trunk/hdfs/CHANGES.txt * /hadoop/common/trunk/hdfs/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java * /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Fix For: 0.23.0 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13058605#comment-13058605 ] Ravi Prakash commented on HDFS-2011: Thanks Matt, Todd and Cos! My first patch into Hadoop. Yaay!!! Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Fix For: 0.23.0 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13058136#comment-13058136 ] Hudson commented on HDFS-2011: -- Integrated in Hadoop-Hdfs-trunk-Commit #771 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/771/]) HDFS-2011. Removal and restoration of storage directories on checkpointing failure doesn't work properly. Contributed by Ravi Prakash. mattf : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1141748 Files : * /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java * /hadoop/common/trunk/hdfs/CHANGES.txt * /hadoop/common/trunk/hdfs/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java * /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Fix For: 0.23.0 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056538#comment-13056538 ] Hadoop QA commented on HDFS-2011: - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12484433/HDFS-2011.8.patch against trunk revision 1140030. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/859//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/859//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/859//console This message is automatically generated. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055660#comment-13055660 ] Ravi Prakash commented on HDFS-2011: Thanks Matt, Incorporated all your comments :) {quote} 9. In assertTrue(List of storage directories didn't have storageDirToCheck), did you intend to iterate over all elements in the list? You go to the trouble of creating an iterator, and then only use the first element. {quote} I meant to get the first element of the Collection (since that's what nnStorage.getEditsDirectories() returns me). Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055663#comment-13055663 ] Ravi Prakash commented on HDFS-2011: {quote} Oh, and in that last bit of code, if fc is still open I would think it should be closed after the truncate. But it isn't in the current code. Can you see a reason why? {quote} I don't know how File Channels work, but in the constructor you can see that fc and fp are both derived from the same RandomAccessFile (rp). Could calling fp.close() automatically close fc too? {quote} Ravi, I don't think this collides with HDFS-988, but please check. {quote} diffstat's didn't have any common files. {noformat} $ diffstat hdfs-988-7.patch java/org/apache/hadoop/hdfs/DFSOutputStream.java |5 java/org/apache/hadoop/hdfs/server/namenode/BlockManager.java |5 java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java | 804 ++-- java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java | 10 java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java | 1891 +- java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java |1 test/hdfs/org/apache/hadoop/cli/TestHDFSCLI.java |2 test/hdfs/org/apache/hadoop/hdfs/DFSTestUtil.java |9 test/hdfs/org/apache/hadoop/hdfs/TestDecommission.java |6 test/hdfs/org/apache/hadoop/hdfs/TestSafeMode.java | 208 - test/hdfs/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java | 15 test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestDeadDatanode.java |9 test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestHeartbeatHandling.java |8 test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestNNThroughputBenchmark.java |1 test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestSafeMode.java | 90 test/unit/org/apache/hadoop/hdfs/server/namenode/TestNNLeaseRecovery.java | 39 16 files changed, 1676 insertions(+), 1427 deletions(-) $ diffstat HDFS-2011.6.patch java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java | 31 +++- java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java | 6 test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java | 67 ++ 3 files changed, 94 insertions(+), 10 deletions(-) {noformat} test-patch passed and I also ran my automated test twice just to be sure. Functionality doesn't seem to have collided. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055689#comment-13055689 ] Hadoop QA commented on HDFS-2011: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12483975/HDFS-2011.6.patch against trunk revision 1140030. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.namenode.TestCheckpoint org.apache.hadoop.hdfs.TestFileAppend2 +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/848//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/848//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/848//console This message is automatically generated. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055745#comment-13055745 ] Ravi Prakash commented on HDFS-2011: The test failed because fc.close() wasn't being called. Thanks Matt! :) Including that in the latest patch. The two failed tests passed and test-patch too. Also my automated unit tests. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055799#comment-13055799 ] Hadoop QA commented on HDFS-2011: - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12484007/HDFS-2011.7.patch against trunk revision 1140030. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/849//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/849//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/849//console This message is automatically generated. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056216#comment-13056216 ] Matt Foley commented on HDFS-2011: -- Looking great! And you're right about needing the iterator to access nnStorage.getEditsDirectories(), since it's just a Collection. There's just one small thing I'd like to fix: * in testSetCheckpointTimeInStorageHandlesIOException, new File ctor, you shouldn't need slashes around /storageDirToCheck/, storageDirToCheck should work. * and one trivial edit: The comment in EditLogFileOutputStream.close(), // if already closed, just return isn't correct any more. // if already closed, just skip would be correct. If that's okay, I'll commit it on the next round. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13049387#comment-13049387 ] Matt Foley commented on HDFS-2011: -- Ravi, I don't think this collides with HDFS-988, but please check. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13045163#comment-13045163 ] Matt Foley commented on HDFS-2011: -- Hi Ravi, looking a lot better. Here's a few more. TestCheckpoint.testEditLogFileOutputStreamCloses(): 1. Use the two-argument form of File ctor: File(System.getProperty(test.build.data,/tmp), editLogStream.dat) not File(System.getProperty(test.build.data,/tmp) + editLogStream.dat) This will insure the path delimiter is inserted correctly. 2. in the finally clause: Again, there's no point in catch-then-assert, unless you need to do something in between. You can just let it fail. The point of the try/catch I recommended was so that it WOULDN'T fail, because failing could prevent any prior exception info from propagating. So the catch clause should use println to log the problem, but not fail or otherwise cause an assert. 3. if you want to fine-tune that a little, you could have a variable success which is set to false at the beginning, and set to true at the end of the main body (before the finally clause). Then in this catch clause you could throw if success==true, but just println if !success. 4. If you need to use println or Assert to message an exception, you can use StringUtils.stringifyException(e), which prints the whole stack trace, vs e.toString(), which only prints one line of info. But LOG and throw messages allow using Exception objects as additional arguments, giving the same result as StringUtils.stringifyException() but with cleaner syntax. testSetCheckpointTimeInStorageHandlesIOException(): 5. Use File(System.getProperty(test.build.data,/tmp), storageDirToCheck) instead of File (System.getProperty(test.build.data,/tmp) + /storageDirToCheck/) 6. and don't put a space between File and the following parenthesis. A ctor is a method call. 7. You probably want to use mkdirs() rather than mkdir(). 8. Extra blanks before and after argument lists are against the coding style standard. 9. In assertTrue(List of storage directories didn't have storageDirToCheck), did you intend to iterate over all elements in the list? You go to the trouble of creating an iterator, and then only use the first element. 10. Doesn't this routine also need a try/catch context that cleans up the created directory if an error occurs? EditLogFileOutputStream.close(): 11. Interesting problem. It looks like fc and fp are not directly dependent on bufCurrent and bufReady, but simply are likely to be null if bufCurrent and bufReady end up null, therefore I think we should still treat bufCurrent and bufReady as possibly valid/invalid separately. Can you tell if the problem value of fc is null or simply a file descriptor that is already closed? What if you use the previously suggested statements for bufCurrent and bufReady, followed by: {code} // remove the last INVALID marker from transaction log. if (fc != null fc.isOpen()) { fc.truncate(fc.position()); } if (fp != null) { fp.close(); } {code} Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13045168#comment-13045168 ] Matt Foley commented on HDFS-2011: -- Oh, and in that last bit of code, if fc is still open I would think it should be closed after the truncate. But it isn't in the current code. Can you see a reason why? Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13043370#comment-13043370 ] Ravi Prakash commented on HDFS-2011: Hi Matt, Thanks a ton for your review! I learned a lot from your detailed explanations. :) I followed all of your suggestions. Couple of things to note 1. To be able to throw exceptions like you suggested, I had to make my two functions individual jUnit tests. I hope that is fine. (Earlier they were being called from testCheckpoint() throws IOException) 2. Thanks for the tip to use toURI. :) However, when I used new Path(System.getProperty(test.build.data,/tmp), storageDirToCheck).toUri(), the test failed saying {noformat} Testcase: testSetCheckpointTimeInStorageHandlesIOException took 0.077 sec Caused an ERROR Undefined scheme for /home/raviprak/Code/hadoop/hadoop-hdfs/build/test/data/storageDirToCheck java.io.IOException: Undefined scheme for /home/raviprak/Code/hadoop/hadoop-hdfs/build/test/data/storageDirToCheck at org.apache.hadoop.hdfs.server.namenode.NNStorage.checkSchemeConsistency(NNStorage.java:348) at org.apache.hadoop.hdfs.server.namenode.NNStorage.setStorageDirectories(NNStorage.java:306) at org.apache.hadoop.hdfs.server.namenode.TestCheckpoint.testSetCheckpointTimeInStorageHandlesIOException(TestCheckpoint.java:179) {noformat} So I changed it to use new File(...).toURI(). I hope that is fine too. 3. In the comment, I meant to convey that the block of code was for when writeCheckpointTime incurred an IOException. I've removed the comment seeing that it had been already mentioned by the comment above it. Sorry for the ambiguity. 4. When I separated the bufCurrent and bufReady cases, the test failed saying {noformat} Testcase: testEditLogFileOutputStreamCloses took 0.042 sec Caused an ERROR Bad file descriptor java.io.IOException: Bad file descriptor at sun.nio.ch.FileChannelImpl.position0(Native Method) at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:284) at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.close(EditLogFileOutputStream.java:141) at org.apache.hadoop.hdfs.server.namenode.TestCheckpoint.testEditLogFileOutputStreamCloses(TestCheckpoint.java:154) {noformat} This was because these lines (more specifically the 1st) were still being called. {noformat} // remove the last INVALID marker from transaction log. fc.truncate(fc.position()); fp.close(); {noformat} I've let it remain the same. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13043414#comment-13043414 ] Hadoop QA commented on HDFS-2011: - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481349/HDFS-2011.4.patch against trunk revision 1130870. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/696//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/696//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/696//console This message is automatically generated. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13043892#comment-13043892 ] Todd Lipcon commented on HDFS-2011: --- A few style nits on HDFS-2011.4.patch: - please try to keep lines under 80 columns wide where possible. If it spills over to 85 or 90 here and there, not a huge deal, but 100+ columns should be avoided - why catch SecurityException? that looks very much out of place, and given it's an unchecked exception, you don't need to catch it at all - the assertTrue around mkdir() in testSetCheckpoingTimeInStorageHandlesIOException should probably check exists() || mkdir(). Or call deleteFully on it at the top of the test - you construct that same file path several times in the same test. Please just make it once as a constant - in the error messages, better to do something like: Couldn't remove directory + TEST_STORAGE_DIR.getAbsoluteFile(). That way the developer can easily track down the full path - alignment is off in NNStorage.java change - the comment referring to edit and edits.new in ELFOS is out of place - that class shouldn't know about details of how it's used. Instead it should read something like // if already closed, just return - TestCheckpoint inherits from TestCase, so you don't need to import org.junit.Assert Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13043990#comment-13043990 ] Ravi Prakash commented on HDFS-2011: Hi Todd, Thanks a lot for reviewing the patch. :) I continue to learn :) I have followed all of your suggestions. The only note is that I am checking for SecurityException so that in the finally block it doesn't mask an IOException / NullPointerException that was possibly thrown in the try block. I hope that is fine. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13044011#comment-13044011 ] Todd Lipcon commented on HDFS-2011: --- re SecurityException: I still don't see any reason that delete() would throw such an exception. AFAIK that only happens if a security manager is installed, which we never expect in unit tests. Will try to look over the new patch revision later today or early next week. Thanks! Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13044025#comment-13044025 ] Hadoop QA commented on HDFS-2011: - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481396/HDFS-2011.5.patch against trunk revision 1131124. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/701//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/701//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/701//console This message is automatically generated. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13043144#comment-13043144 ] Matt Foley commented on HDFS-2011: -- Hi Ravi, the logic of your changes is fine. The following comments are almost all regarding common usages in Hadoop code base and unit tests. TestCheckpoint.checkEditLogFileOutputStreamCloses(): * To obtain the build/test/data directory correctly, use System.getProperty(test.build.data,/tmp) rather than hardcoding it; then create your desired file relative to that directory. * I find elfosFile to be a very opaque name. Would it be reasonable to use something like editLogStream instead? * Instead of Assert.assertTrue(msg,false), use Assert.fail(msg). * But, there is no need to catch and message exceptions that shouldn't happen. Both catch{} clauses add no significant value compared to the stack trace that will be printed on exception, by junit. In fact, the stack trace from the catch-and-Assert is LESS informative than the original exception stack trace would have been, because it points into the catch clause instead of into where the exception actually occurred. * It's good that you bracket both the beginning and end with printlns that clearly state what is being tested; if an exception occurs the developer will immediately see what went wrong (with the help of the stack trace). * Within the finally{} clause, it might be a good idea to put the delete() call in its own try/catch context. If another exception happened, you wouldn't want to interfere with the original exception message, which carries the info you created the testcase to expose. checkSetCheckpointTimeInStorageHandlesIOException(): * alFS and alES are also very opaque names. Hadoop doesn't subscribe to Hungarian naming, so the al prefix isn't needed. FS usually means FileSystem, which isn't the same as a StorageDirectory. So consider renaming these, perhaps to fsImageDirs and editsDirs. * First try/catch context: Again, there's no need to catch-and-Assert unexpected failures. If they occur, they will be duly reported by junit. * As before, the place to put the directories you create should be relative to System.getProperty(test.build.data,/tmp). * And to create the URIs, it is probably best to do something equivalent to new Path(System.getProperty(test.build.data,/tmp), storageDirToCheck).toUri(). This will work around any filesystem path naming oddities. * In the assert, use listRsd.get(listRsd.size()-1) instead of listRsd.get(0), because the new element would be added to the end of the list -- I think :-) * It might be good to use nnStorage.getEditsDirectories() and/or nnStorage.getImageDirectories() before deleting the dir, to assure that the setStorageDirectories() had the expected result, and call nnStorage.getRemovedStorageDirs() before to assure that the list initially does not contain storageDirToCheck. NNStorage.setCheckpointTimeInStorage(): * In the comment //Since writeCheckpointTime may also encounter an IOException in case underlying storage fails substitute reportErrorsOnDirectories() for writeCheckpointTime. * There is a singular reportErrorsOnDirectory() method. Could you use it instead of reportErrorsOnDirectories()? Then you wouldn't need to construct the ArrayList. * In the LOG.error if the second IOE happens, suggest LOG.error(Failed to report and remove NN storage directory + sd.getRoot().getPath(), ioe); Besides clarifying the msg, note that + ioe uses ioe.toString(), which only prints a single line about the exception, while using , ioe in a LOG argument list causes the entire stack trace to be printed. EditLogFileOutputStream.close(): * Suggest you separate the bufCurrent and bufReady cases. Do: {code} if (bufCurrent != null) { int bufSize = bufCurrent.size(); if (bufSize != 0) { throw new IOException(FSEditStream has + bufSize + bytes still to be flushed and cannot + be closed.); } bufCurrent.close(); } if (bufReady != null) { bufReady.close(); } {code} Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.3.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13043220#comment-13043220 ] Konstantin Boudnik commented on HDFS-2011: -- There's also this error message {{+LOG.error(Problem erroring streams + ioe);}} which is somewhat moot. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.3.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042285#comment-13042285 ] Ravi Prakash commented on HDFS-2011: I ran test-patch. Also ran ant-test and no new test failures have been introduced. Can someone please review / commit the patch? Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042356#comment-13042356 ] Hadoop QA commented on HDFS-2011: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481110/HDFS-2011.patch against trunk revision 1129942. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestDFSUpgradeFromImage org.apache.hadoop.hdfs.TestHFlush +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/672//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/672//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/672//console This message is automatically generated. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042365#comment-13042365 ] Hadoop QA commented on HDFS-2011: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481119/HDFS-2011.patch against trunk revision 1129942. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/674//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/674//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/674//console This message is automatically generated. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042402#comment-13042402 ] Hadoop QA commented on HDFS-2011: - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481129/HDFS-2011.3.patch against trunk revision 1130262. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/676//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/676//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/676//console This message is automatically generated. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.3.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041707#comment-13041707 ] Matt Foley commented on HDFS-2011: -- Hi Ravi, for future reference please write a short Description field, then add the long details in a first Comment. The problem is the Description gets re-sent in every Jira email about the ticket. Thanks. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.patch I had been automating tests to verify the removal and restoration of storage directories. I was testing by setting up a loopback file system, using that as one of the storage directories, and filling it up to make the writes from Hadoop namenode to the checkpoint fail. Mostly I would see the functionality work. However, very often I would see this exception in the logs: 2011-05-29 23:34:30,241 WARN org.mortbay.log: /getimage: java.io.IOException: GetImage failed. java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:297) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:224) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:101) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:98) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:97) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:74) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:74) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:871) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522) In this case the storage directory wasn't taken offline. It would not be removed from the list. John George figured out this was because the IOException was happening in a code path fromm where the function to remove the corresponding wasn't being called. Also, very rarely, I would see this exception 2011-04-05 17:36:56,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 87 on 8020, call getEditLogSize() from 98.137.97.99:35862:
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041858#comment-13041858 ] Ravi Prakash commented on HDFS-2011: I had been automating tests to verify the removal and restoration of storage directories. I was testing by setting up a loopback file system, using that as one of the storage directories, and filling it up to make the writes from Hadoop namenode to the checkpoint fail. Mostly I would see the functionality work. However, very often I would see this exception in the logs: 2011-05-29 23:34:30,241 WARN org.mortbay.log: /getimage: java.io.IOException: GetImage failed. java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:297) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:224) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:101) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:98) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:97) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:74) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:74) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:871) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522) In this case the storage directory wasn't taken offline. It would not be removed from the list. John George figured out this was because the IOException was happening in a code path fromm where the function to remove the corresponding wasn't being called. Also, very rarely, I would see this exception 2011-04-05 17:36:56,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 87 on 8020, call getEditLogSize() from 98.137.97.99:35862: error: java.io.IOException: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.close(EditLogFileOutputStream.java:109) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.processIOError(FSEditLog.java:299) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.getEditLogSize(FSEditLog.java:849) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getEditLogSize(FSNamesystem.java:4270) at org.apache.hadoop.hdfs.server.namenode.NameNode.getEditLogSize(NameNode.java:1095) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:346) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1399) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1395) at java.security.AccessController.doPrivileged(Native Method) at
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041859#comment-13041859 ] Ravi Prakash commented on HDFS-2011: Thanks for your comments Todd and Matt! :) I'm working on a unit test. I'm almost done. Sorry for the junking the emails. I've edited the JIRA and shortened the description. I promise it won't happen again. Thanks for the advice. :) Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.patch Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041195#comment-13041195 ] Hadoop QA commented on HDFS-2011: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12480858/HDFS-2011.patch against trunk revision 1128987. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/659//console This message is automatically generated. Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.patch I had been automating tests to verify the removal and restoration of storage directories. I was testing by setting up a loopback file system, using that as one of the storage directories, and filling it up to make the writes from Hadoop namenode to the checkpoint fail. Mostly I would see the functionality work. However, very often I would see this exception in the logs: 2011-05-29 23:34:30,241 WARN org.mortbay.log: /getimage: java.io.IOException: GetImage failed. java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:297) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:224) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:101) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:98) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:97) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:74) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:74) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:871) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at
[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly
[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041200#comment-13041200 ] Todd Lipcon commented on HDFS-2011: --- Any chance of unit tests for these? Removal and restoration of storage directories on checkpointing failure doesn't work properly - Key: HDFS-2011 URL: https://issues.apache.org/jira/browse/HDFS-2011 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.0 Reporter: Ravi Prakash Assignee: Ravi Prakash Attachments: HDFS-2011.patch I had been automating tests to verify the removal and restoration of storage directories. I was testing by setting up a loopback file system, using that as one of the storage directories, and filling it up to make the writes from Hadoop namenode to the checkpoint fail. Mostly I would see the functionality work. However, very often I would see this exception in the logs: 2011-05-29 23:34:30,241 WARN org.mortbay.log: /getimage: java.io.IOException: GetImage failed. java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:297) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:224) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:101) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:98) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:97) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:74) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:74) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:871) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522) In this case the storage directory wasn't taken offline. It would not be removed from the list. John George figured out this was because the IOException was happening in a code path fromm where the function to remove the corresponding wasn't being called. Also, very rarely, I would see this exception 2011-04-05 17:36:56,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 87 on 8020, call getEditLogSize() from 98.137.97.99:35862: error: java.io.IOException: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at