[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-07-18 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067108#comment-13067108
 ] 

Eli Collins commented on HDFS-2011:
---

Ah, right, abort is new in 1073, I think the tests you have for trunk are fine 
for now. The new test with abort will come in when 1073 is merged.

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Fix For: 0.23.0

 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, 
 HDFS-2011.patch, HDFS-2011.patch, elfos-close-patch-on-1073-2.txt, 
 elfos-close-patch-on-1073-3.txt, elfos-close-patch-on-1073.txt


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-07-17 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13066603#comment-13066603
 ] 

Ravi Prakash commented on HDFS-2011:


@Eli - The new patch tests abort() but I couldn't find the method in 
EditLogFileOutputStream.java in trunk. Its available in HDFS-1073. Could you 
please point me to what exactly I should incorporate into trunk?

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Fix For: 0.23.0

 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, 
 HDFS-2011.patch, HDFS-2011.patch, elfos-close-patch-on-1073-2.txt, 
 elfos-close-patch-on-1073-3.txt, elfos-close-patch-on-1073.txt


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-07-14 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13065464#comment-13065464
 ] 

Todd Lipcon commented on HDFS-2011:
---

Committed elfos-close-patch-on-1073-3.txt to the HDFS-1073 branch to fix the 
test case. Thanks Eli.

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Fix For: 0.23.0

 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, 
 HDFS-2011.patch, HDFS-2011.patch, elfos-close-patch-on-1073-2.txt, 
 elfos-close-patch-on-1073-3.txt, elfos-close-patch-on-1073.txt


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-07-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13065667#comment-13065667
 ] 

Hudson commented on HDFS-2011:
--

Integrated in Hadoop-Hdfs-1073-branch #9 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-1073-branch/9/])
Amend HDFS-2011 for HDFS-1073 branch. Update test cases for new behavior of 
EditLogFileOutputStream. Contributed by Todd Lipcon and Eli Collins.

todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1146848
Files : 
* 
/hadoop/common/branches/HDFS-1073/hdfs/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java
* 
/hadoop/common/branches/HDFS-1073/hdfs/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Fix For: 0.23.0

 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, 
 HDFS-2011.patch, HDFS-2011.patch, elfos-close-patch-on-1073-2.txt, 
 elfos-close-patch-on-1073-3.txt, elfos-close-patch-on-1073.txt


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-07-13 Thread John George (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13064880#comment-13064880
 ] 

John George commented on HDFS-2011:
---

+1 Looks good to me. Thanks Eli.

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Fix For: 0.23.0

 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, 
 HDFS-2011.patch, HDFS-2011.patch, elfos-close-patch-on-1073-2.txt, 
 elfos-close-patch-on-1073.txt


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-07-07 Thread John George (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061273#comment-13061273
 ] 

John George commented on HDFS-2011:
---

The patch looks good. Shouldn't the sequence close() close() be tested as 
well, since that could be a case that could possibly happen?

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Fix For: 0.23.0

 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, 
 HDFS-2011.patch, HDFS-2011.patch, elfos-close-patch-on-1073.txt


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-07-06 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060745#comment-13060745
 ] 

Todd Lipcon commented on HDFS-2011:
---

I'm working on merging this with HDFS-1073, and had one question: when do we 
expect that an editlog stream would be closed twice? In 1073 there are some 
extra asserts, so instead of ignoring the second close, it now throws 
java.io.IOException: Trying to use aborted output stream. I'm debating 
whether to remove this exception like you've done in this patch, vs remove the 
patch, since it seems like it might be indicative of a bug to close a stream 
twice.

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Fix For: 0.23.0

 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, 
 HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-07-06 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060775#comment-13060775
 ] 

Ravi Prakash commented on HDFS-2011:


I had noticed close being called twice while testing this functionality . This 
was causing a NullPointerException the second time. The stack trace is given in 
comment 
https://issues.apache.org/jira/browse/HDFS-2011?focusedCommentId=13041858page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13041858

{quote}
2011-04-05 17:36:56,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
87 on 8020, call getEditLogSize() from
98.137.97.99:35862: error: java.io.IOException: java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.close(EditLogFileOutputStream.java:109)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.processIOError(FSEditLog.java:299)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.getEditLogSize(FSEditLog.java:849)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getEditLogSize(FSNamesystem.java:4270)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.getEditLogSize(NameNode.java:1095)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:346)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1399)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1395)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1094)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1393)
{quote}

The bug itself is quite hard to reproduce. I had to run my tests in an infinite 
loop and the NullPointerException happened after 3-4 hours (each run of the 
test would take 2 mins maybe). After the NullPointerException, the namenode 
would essentially be useless. Even hdfs dfs -ls would throw a 
NullPointerException.

I am not sure myself which philosophy would be better. FileOutputStream itself 
ignores a second close. I checked this with the following program

{noformat}
import java.io.*;

public class TestJAVA 
{

public static void main(String args[]) 
{
System.out.println(Hello World);
try {

FileOutputStream fos = new 
FileOutputStream(/tmp/ravi.txt);
fos.write(50);
fos.write(50);
fos.write(50);
fos.write(50);
fos.write(50);
fos.write(50);
fos.close();
fos.close();
} catch (IOException ioe) {
System.out.println(Hello California);
System.out.println (ioe);
}
System.out.println(Hello Champaign);

}

}
{noformat}

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Fix For: 0.23.0

 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, 
 HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-07-06 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060777#comment-13060777
 ] 

Ravi Prakash commented on HDFS-2011:


The program above output
{noformat}
Hello World
Hello Champaign
{noformat}

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Fix For: 0.23.0

 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, 
 HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-07-06 Thread John George (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060780#comment-13060780
 ] 

John George commented on HDFS-2011:
---

If I remember right, it was a case of an incomplete create as opposed to 
close being called twice. So, the close() was being called on a stream that was 
not really created...

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Fix For: 0.23.0

 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, 
 HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-07-06 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060812#comment-13060812
 ] 

Todd Lipcon commented on HDFS-2011:
---

In the HDFS-1073 branch, EditLogOutputStream now has separate close() and 
abort() methods. abort() is used when there has been some error on the stream 
and we expect to do an unclean close (ie without flushing). close() is used 
for clean closes. If close() itself fails, it will then proceed to abort() when 
the IO error is handled.

So, I think the correct test case on the branch is to call abort() twice and 
make sure that's ignored, or call close() and then abort() to make sure that's 
ignored. Does that sound reasonable?

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Fix For: 0.23.0

 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, 
 HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-07-06 Thread John George (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060853#comment-13060853
 ] 

John George commented on HDFS-2011:
---

I think calling
1. abort() twice
2. close() twice
3. close() followed by an abort()

would test most cases.

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Fix For: 0.23.0

 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, 
 HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-07-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13058532#comment-13058532
 ] 

Hudson commented on HDFS-2011:
--

Integrated in Hadoop-Hdfs-trunk #712 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/712/])
HDFS-2011. Removal and restoration of storage directories on checkpointing 
failure doesn't work properly. Contributed by Ravi Prakash.

mattf : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1141748
Files : 
* 
/hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java
* /hadoop/common/trunk/hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hdfs/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java
* 
/hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Fix For: 0.23.0

 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, 
 HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-07-01 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13058605#comment-13058605
 ] 

Ravi Prakash commented on HDFS-2011:


Thanks Matt, Todd and Cos! My first patch into Hadoop. Yaay!!!

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Fix For: 0.23.0

 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, 
 HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13058136#comment-13058136
 ] 

Hudson commented on HDFS-2011:
--

Integrated in Hadoop-Hdfs-trunk-Commit #771 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/771/])
HDFS-2011. Removal and restoration of storage directories on checkpointing 
failure doesn't work properly. Contributed by Ravi Prakash.

mattf : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1141748
Files : 
* 
/hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java
* /hadoop/common/trunk/hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hdfs/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java
* 
/hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Fix For: 0.23.0

 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, 
 HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056538#comment-13056538
 ] 

Hadoop QA commented on HDFS-2011:
-

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12484433/HDFS-2011.8.patch
  against trunk revision 1140030.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

+1 system test framework.  The patch passed system test framework compile.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/859//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/859//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/859//console

This message is automatically generated.

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.8.patch, HDFS-2011.patch, 
 HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-27 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055660#comment-13055660
 ] 

Ravi Prakash commented on HDFS-2011:


Thanks Matt,

Incorporated all your comments :)

{quote}
9. In assertTrue(List of storage directories didn't have 
storageDirToCheck), did you intend to iterate over all elements in the 
list? You go to the trouble of creating an iterator, and then only use the 
first element.
{quote}
I meant to get the first element of the Collection (since that's what 
nnStorage.getEditsDirectories() returns me). 


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-27 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055663#comment-13055663
 ] 

Ravi Prakash commented on HDFS-2011:


{quote}
Oh, and in that last bit of code, if fc is still open I would think it should 
be closed after the truncate. But it isn't in the current code. Can you see a 
reason why?
{quote}

I don't know how File Channels work, but in the constructor you can see that fc 
and fp are both derived from the same RandomAccessFile (rp). Could calling 
fp.close() automatically close fc too?


{quote}
Ravi, I don't think this collides with HDFS-988, but please check.
{quote}
diffstat's didn't have any common files. 
{noformat}
$ diffstat hdfs-988-7.patch 
 java/org/apache/hadoop/hdfs/DFSOutputStream.java   
 |5 
 java/org/apache/hadoop/hdfs/server/namenode/BlockManager.java  
 |5 
 java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java   
 |  804 ++--
 java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java   
 |   10 
 java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java  
 | 1891 +-
 java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java  
 |1 
 test/hdfs/org/apache/hadoop/cli/TestHDFSCLI.java   
 |2 
 test/hdfs/org/apache/hadoop/hdfs/DFSTestUtil.java  
 |9 
 test/hdfs/org/apache/hadoop/hdfs/TestDecommission.java 
 |6 
 test/hdfs/org/apache/hadoop/hdfs/TestSafeMode.java 
 |  208 -
 test/hdfs/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java  
 |   15 
 test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestDeadDatanode.java 
 |9 
 test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestHeartbeatHandling.java
 |8 
 
test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestNNThroughputBenchmark.java 
|1 
 test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestSafeMode.java 
 |   90 
 test/unit/org/apache/hadoop/hdfs/server/namenode/TestNNLeaseRecovery.java  
 |   39 
 16 files changed, 1676 insertions(+), 1427 deletions(-)
$ diffstat HDFS-2011.6.patch 
 java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java |   
31 +++-
 java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java   |
6 
 test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java |   
67 ++
 3 files changed, 94 insertions(+), 10 deletions(-)
{noformat}
test-patch passed and I also ran my automated test twice just to be sure. 
Functionality doesn't seem to have collided.



 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-27 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055689#comment-13055689
 ] 

Hadoop QA commented on HDFS-2011:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12483975/HDFS-2011.6.patch
  against trunk revision 1140030.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed these core unit tests:
  org.apache.hadoop.hdfs.server.namenode.TestCheckpoint
  org.apache.hadoop.hdfs.TestFileAppend2

+1 contrib tests.  The patch passed contrib unit tests.

+1 system test framework.  The patch passed system test framework compile.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/848//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/848//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/848//console

This message is automatically generated.

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-27 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055745#comment-13055745
 ] 

Ravi Prakash commented on HDFS-2011:


The test failed because fc.close() wasn't being called. Thanks Matt! :) 
Including that in the latest patch. The two failed tests passed and test-patch 
too. Also my automated unit tests.


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.patch, HDFS-2011.patch, 
 HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-27 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055799#comment-13055799
 ] 

Hadoop QA commented on HDFS-2011:
-

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12484007/HDFS-2011.7.patch
  against trunk revision 1140030.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

+1 system test framework.  The patch passed system test framework compile.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/849//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/849//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/849//console

This message is automatically generated.

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.patch, HDFS-2011.patch, 
 HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-27 Thread Matt Foley (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056216#comment-13056216
 ] 

Matt Foley commented on HDFS-2011:
--

Looking great!  And you're right about needing the iterator to access 
nnStorage.getEditsDirectories(), since it's just a Collection.

There's just one small thing I'd like to fix:
* in testSetCheckpointTimeInStorageHandlesIOException, new File ctor, you 
shouldn't need slashes around /storageDirToCheck/, storageDirToCheck should 
work.
* and one trivial edit: The comment in EditLogFileOutputStream.close(), // if 
already closed, just return isn't correct any more.  // if already closed, 
just skip would be correct.

If that's okay, I'll commit it on the next round.

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.6.patch, HDFS-2011.7.patch, HDFS-2011.patch, HDFS-2011.patch, 
 HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-14 Thread Matt Foley (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13049387#comment-13049387
 ] 

Matt Foley commented on HDFS-2011:
--

Ravi, I don't think this collides with HDFS-988, but please check.

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-06 Thread Matt Foley (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13045163#comment-13045163
 ] 

Matt Foley commented on HDFS-2011:
--

Hi Ravi, looking a lot better.  Here's a few more.

TestCheckpoint.testEditLogFileOutputStreamCloses():
1. Use the two-argument form of File ctor:
File(System.getProperty(test.build.data,/tmp), editLogStream.dat)
not
File(System.getProperty(test.build.data,/tmp) + editLogStream.dat)
This will insure the path delimiter is inserted correctly.
2. in the finally clause:
Again, there's no point in catch-then-assert, unless you need to do something 
in between.  You can just let it fail.  The point of the try/catch I 
recommended was so that it WOULDN'T fail, because failing could prevent any 
prior exception info from propagating.  So the catch clause should use 
println to log the problem, but not fail or otherwise cause an assert.
3. if you want to fine-tune that a little, you could have a variable success 
which is set to false at the beginning, and set to true at the end of the main 
body (before the finally clause).  Then in this catch clause you could 
throw if success==true, but just println if !success.
4. If you need to use println or Assert to message an exception, you can use 
StringUtils.stringifyException(e), which prints the whole stack trace, vs 
e.toString(), which only prints one line of info. But LOG and throw 
messages allow using Exception objects as additional arguments, giving the same 
result as StringUtils.stringifyException() but with cleaner syntax.

testSetCheckpointTimeInStorageHandlesIOException():
5. Use File(System.getProperty(test.build.data,/tmp), storageDirToCheck)
instead of File (System.getProperty(test.build.data,/tmp) + 
/storageDirToCheck/)
6. and don't put a space between File and the following parenthesis.  A ctor 
is a method call.
7. You probably want to use mkdirs() rather than mkdir().
8. Extra blanks before and after argument lists are against the coding style 
standard.
9. In assertTrue(List of storage directories didn't have 
storageDirToCheck), did you intend to iterate over all elements in the 
list?  You go to the trouble of creating an iterator, and then only use the 
first element.
10. Doesn't this routine also need a try/catch context that cleans up the 
created directory if an error occurs?

EditLogFileOutputStream.close():
11. Interesting problem.  It looks like fc and fp are not directly dependent on 
bufCurrent and bufReady, but simply are likely to be null if bufCurrent and 
bufReady end up null, therefore I think we should still treat bufCurrent and 
bufReady as possibly valid/invalid separately.  Can you tell if the problem 
value of fc is null or simply a file descriptor that is already closed?  What 
if you use the previously suggested statements for bufCurrent and bufReady, 
followed by:
{code}
// remove the last INVALID marker from transaction log.
if (fc != null  fc.isOpen()) {
  fc.truncate(fc.position());
}
if (fp != null) {
  fp.close();
}
{code}


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-06 Thread Matt Foley (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13045168#comment-13045168
 ] 

Matt Foley commented on HDFS-2011:
--

Oh, and in that last bit of code, if fc is still open I would think it should 
be closed after the truncate.  But it isn't in the current code.  Can you see a 
reason why?

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-03 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13043370#comment-13043370
 ] 

Ravi Prakash commented on HDFS-2011:


Hi Matt,

Thanks a ton for your review! I learned a lot from your detailed explanations. 
:) I followed all of your suggestions. Couple of things to note

1. To be able to throw exceptions like you suggested, I had to make my two 
functions individual jUnit tests. I hope that is fine. (Earlier they were being 
called from testCheckpoint() throws IOException)
2. Thanks for the tip to use toURI. :) However, when I used new 
Path(System.getProperty(test.build.data,/tmp), 
storageDirToCheck).toUri(), the test failed saying 
{noformat} 
Testcase: testSetCheckpointTimeInStorageHandlesIOException took 0.077 sec
Caused an ERROR
Undefined scheme for 
/home/raviprak/Code/hadoop/hadoop-hdfs/build/test/data/storageDirToCheck
java.io.IOException: Undefined scheme for 
/home/raviprak/Code/hadoop/hadoop-hdfs/build/test/data/storageDirToCheck
at 
org.apache.hadoop.hdfs.server.namenode.NNStorage.checkSchemeConsistency(NNStorage.java:348)
at 
org.apache.hadoop.hdfs.server.namenode.NNStorage.setStorageDirectories(NNStorage.java:306)
at 
org.apache.hadoop.hdfs.server.namenode.TestCheckpoint.testSetCheckpointTimeInStorageHandlesIOException(TestCheckpoint.java:179)
{noformat} 
So I changed it to use new File(...).toURI(). I hope that is fine too. 
3. In the comment, I meant to convey that the block of code was for when 
writeCheckpointTime incurred an IOException. I've removed the comment seeing 
that it had been already mentioned by the comment above it. Sorry for the 
ambiguity.
4. When I separated the bufCurrent and bufReady cases, the test failed saying 
{noformat} 
Testcase: testEditLogFileOutputStreamCloses took 0.042 sec
Caused an ERROR
Bad file descriptor
java.io.IOException: Bad file descriptor
at sun.nio.ch.FileChannelImpl.position0(Native Method)
at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:284)
at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.close(EditLogFileOutputStream.java:141)
at 
org.apache.hadoop.hdfs.server.namenode.TestCheckpoint.testEditLogFileOutputStreamCloses(TestCheckpoint.java:154)
{noformat} 
This was because these lines (more specifically the 1st) were still being 
called.  
{noformat} 
// remove the last INVALID marker from transaction log.
fc.truncate(fc.position());
fp.close();
{noformat} 
I've let it remain the same. 

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.patch, 
 HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13043414#comment-13043414
 ] 

Hadoop QA commented on HDFS-2011:
-

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12481349/HDFS-2011.4.patch
  against trunk revision 1130870.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 4 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

+1 system test framework.  The patch passed system test framework compile.

Test results: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/696//testReport/
Findbugs warnings: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/696//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/696//console

This message is automatically generated.

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.patch, 
 HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-03 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13043892#comment-13043892
 ] 

Todd Lipcon commented on HDFS-2011:
---

A few style nits on HDFS-2011.4.patch:
- please try to keep lines under 80 columns wide where possible. If it spills 
over to 85 or 90 here and there, not a huge deal, but 100+ columns should be 
avoided
- why catch SecurityException? that looks very much out of place, and given 
it's an unchecked exception, you don't need to catch it at all
- the assertTrue around mkdir() in 
testSetCheckpoingTimeInStorageHandlesIOException should probably check 
exists() || mkdir(). Or call deleteFully on it at the top of the test
- you construct that same file path several times in the same test. Please just 
make it once as a constant
- in the error messages, better to do something like: Couldn't remove 
directory  + TEST_STORAGE_DIR.getAbsoluteFile(). That way the developer can 
easily track down the full path
- alignment is off in NNStorage.java change
- the comment referring to edit and edits.new in ELFOS is out of place - that 
class shouldn't know about details of how it's used. Instead it should read 
something like // if already closed, just return
- TestCheckpoint inherits from TestCase, so you don't need to import 
org.junit.Assert

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.patch, 
 HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-03 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13043990#comment-13043990
 ] 

Ravi Prakash commented on HDFS-2011:


Hi Todd,

Thanks a lot for reviewing the patch. :) I continue to learn :) I have followed 
all of your suggestions. The only note is that I am checking for 
SecurityException so that in the finally block it doesn't mask an IOException / 
NullPointerException that was possibly thrown in the try block. I hope that is 
fine.

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-03 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13044011#comment-13044011
 ] 

Todd Lipcon commented on HDFS-2011:
---

re SecurityException: I still don't see any reason that delete() would throw 
such an exception. AFAIK that only happens if a security manager is installed, 
which we never expect in unit tests.

Will try to look over the new patch revision later today or early next week. 
Thanks!

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13044025#comment-13044025
 ] 

Hadoop QA commented on HDFS-2011:
-

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12481396/HDFS-2011.5.patch
  against trunk revision 1131124.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

+1 system test framework.  The patch passed system test framework compile.

Test results: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/701//testReport/
Findbugs warnings: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/701//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/701//console

This message is automatically generated.

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.3.patch, HDFS-2011.4.patch, HDFS-2011.5.patch, 
 HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-02 Thread Matt Foley (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13043144#comment-13043144
 ] 

Matt Foley commented on HDFS-2011:
--

Hi Ravi, the logic of your changes is fine.  The following comments are almost 
all regarding common usages in Hadoop code base and unit tests.

TestCheckpoint.checkEditLogFileOutputStreamCloses():
* To obtain the build/test/data directory correctly, use 
System.getProperty(test.build.data,/tmp) rather than hardcoding it; then 
create your desired file relative to that directory.
* I find elfosFile to be a very opaque name.  Would it be reasonable to use 
something like editLogStream instead?
* Instead of Assert.assertTrue(msg,false), use Assert.fail(msg).
* But, there is no need to catch and message exceptions that shouldn't happen.  
Both catch{} clauses add no significant value compared to the stack trace 
that will be printed on exception, by junit.  In fact, the stack trace from the 
catch-and-Assert is LESS informative than the original exception stack trace 
would have been, because it points into the catch clause instead of into where 
the exception actually occurred.
* It's good that you bracket both the beginning and end with printlns that 
clearly state what is being tested; if an exception occurs the developer will 
immediately see what went wrong (with the help of the stack trace).
* Within the finally{} clause, it might be a good idea to put the delete() 
call in its own try/catch context.  If another exception happened, you wouldn't 
want to interfere with the original exception message, which carries the info 
you created the testcase to expose.

checkSetCheckpointTimeInStorageHandlesIOException():
* alFS and alES are also very opaque names.  Hadoop doesn't subscribe to 
Hungarian naming, so the al prefix isn't needed.  FS usually means 
FileSystem, which isn't the same as a StorageDirectory.  So consider renaming 
these, perhaps to fsImageDirs and editsDirs.
* First try/catch context:  Again, there's no need to catch-and-Assert 
unexpected failures.  If they occur, they will be duly reported by junit.
* As before, the place to put the directories you create should be relative to 
System.getProperty(test.build.data,/tmp).
* And to create the URIs, it is probably best to do something equivalent to 
new Path(System.getProperty(test.build.data,/tmp), 
storageDirToCheck).toUri().  This will work around any filesystem path 
naming oddities.
* In the assert, use listRsd.get(listRsd.size()-1) instead of listRsd.get(0), 
because the new element would be added to the end of the list -- I think :-)
* It might be good to use nnStorage.getEditsDirectories() and/or 
nnStorage.getImageDirectories() before deleting the dir, to assure that the 
setStorageDirectories() had the expected result, and call 
nnStorage.getRemovedStorageDirs() before to assure that the list initially does 
not contain storageDirToCheck.

NNStorage.setCheckpointTimeInStorage():
* In the comment //Since writeCheckpointTime may also encounter an IOException 
in case underlying storage fails substitute reportErrorsOnDirectories() for 
writeCheckpointTime.
* There is a singular reportErrorsOnDirectory() method.  Could you use it 
instead of reportErrorsOnDirectories()?  Then you wouldn't need to construct 
the ArrayList.
* In the LOG.error if the second IOE happens, suggest 
LOG.error(Failed to report and remove NN storage directory  + 
sd.getRoot().getPath(), ioe);
Besides clarifying the msg, note that + ioe uses ioe.toString(), which only 
prints a single line about the exception, while using , ioe in a LOG argument 
list causes the entire stack trace to be printed.

EditLogFileOutputStream.close():
* Suggest you separate the bufCurrent and bufReady cases.  Do:
{code}
if (bufCurrent != null) {
int bufSize = bufCurrent.size();
if (bufSize != 0) {
  throw new IOException(FSEditStream has  + bufSize
  +  bytes still to be flushed and cannot  + be closed.);
}
bufCurrent.close();
}
if (bufReady != null) {
bufReady.close();
}
{code}


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.3.patch, HDFS-2011.patch, HDFS-2011.patch, 
 HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more 

[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-02 Thread Konstantin Boudnik (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13043220#comment-13043220
 ] 

Konstantin Boudnik commented on HDFS-2011:
--

There's also this error message
{{+LOG.error(Problem erroring streams  + ioe);}}
which is somewhat moot.

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.3.patch, HDFS-2011.patch, HDFS-2011.patch, 
 HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-01 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042285#comment-13042285
 ] 

Ravi Prakash commented on HDFS-2011:


I ran test-patch. Also ran ant-test and no new test failures have been 
introduced. Can someone please review / commit the patch?

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042356#comment-13042356
 ] 

Hadoop QA commented on HDFS-2011:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12481110/HDFS-2011.patch
  against trunk revision 1129942.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 8 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed these core unit tests:
  org.apache.hadoop.hdfs.TestDFSUpgradeFromImage
  org.apache.hadoop.hdfs.TestHFlush

+1 contrib tests.  The patch passed contrib unit tests.

+1 system test framework.  The patch passed system test framework compile.

Test results: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/672//testReport/
Findbugs warnings: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/672//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/672//console

This message is automatically generated.

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042365#comment-13042365
 ] 

Hadoop QA commented on HDFS-2011:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12481119/HDFS-2011.patch
  against trunk revision 1129942.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 8 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed these core unit tests:
  org.apache.hadoop.cli.TestHDFSCLI

+1 contrib tests.  The patch passed contrib unit tests.

+1 system test framework.  The patch passed system test framework compile.

Test results: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/674//testReport/
Findbugs warnings: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/674//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/674//console

This message is automatically generated.

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.patch, HDFS-2011.patch, HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042402#comment-13042402
 ] 

Hadoop QA commented on HDFS-2011:
-

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12481129/HDFS-2011.3.patch
  against trunk revision 1130262.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 8 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

+1 system test framework.  The patch passed system test framework compile.

Test results: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/676//testReport/
Findbugs warnings: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/676//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/676//console

This message is automatically generated.

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.3.patch, HDFS-2011.patch, HDFS-2011.patch, 
 HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-05-31 Thread Matt Foley (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041707#comment-13041707
 ] 

Matt Foley commented on HDFS-2011:
--

Hi Ravi, for future reference please write a short Description field, then add 
the long details in a first Comment.  The problem is the Description gets 
re-sent in every Jira email about the ticket.  Thanks.

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.patch


 I had been automating tests to verify the removal and restoration of storage 
 directories. I was testing by setting up a loopback file system, using that 
 as one of the storage directories, and filling it up to make the writes from 
 Hadoop namenode to the checkpoint fail. 
 Mostly I would see the functionality work. However, very often I would see 
 this exception in the logs: 
 2011-05-29 23:34:30,241 WARN org.mortbay.log: /getimage: java.io.IOException: 
 GetImage failed. java.io.IOException: No space left on device
 at java.io.FileOutputStream.writeBytes(Native Method)
 at java.io.FileOutputStream.write(FileOutputStream.java:297)
 at 
 org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:224)
 at 
 org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:101)
 at 
 org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:98)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:416)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131)
 at 
 org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:97)
 at 
 org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:74)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:416)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131)
 at 
 org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:74)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
 at 
 org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124)
 at 
 org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:871)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115)
 at 
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361)
 at 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at 
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
 at 
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
 at 
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
 at 
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
 at 
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
 at org.mortbay.jetty.Server.handle(Server.java:324)
 at 
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
 at 
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
 at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
 at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
 at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
 at 
 org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
 at 
 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
 In this case the storage directory wasn't taken offline. It would not be 
 removed from the list. John George figured out this was because the 
 IOException was happening in a code path fromm where the function to remove 
 the corresponding wasn't being called. 
 Also, very rarely, I would see this exception
 2011-04-05 17:36:56,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
 87 on 8020, call getEditLogSize() from
 98.137.97.99:35862: 

[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-05-31 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041858#comment-13041858
 ] 

Ravi Prakash commented on HDFS-2011:


I had been automating tests to verify the removal and restoration of storage 
directories. I was testing by setting up a loopback file system, using that as 
one of the storage directories, and filling it up to make the writes from 
Hadoop namenode to the checkpoint fail.
Mostly I would see the functionality work. However, very often I would see this 
exception in the logs:

2011-05-29 23:34:30,241 WARN org.mortbay.log: /getimage: java.io.IOException: 
GetImage failed. java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:297)
at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:224)
at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:101)
at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:98)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131)
at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:97)
at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:74)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131)
at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:74)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124)
at 
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:871)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)

In this case the storage directory wasn't taken offline. It would not be 
removed from the list. John George figured out this was because the IOException 
was happening in a code path fromm where the function to remove the 
corresponding wasn't being called.

Also, very rarely, I would see this exception

2011-04-05 17:36:56,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
87 on 8020, call getEditLogSize() from
98.137.97.99:35862: error: java.io.IOException: java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.close(EditLogFileOutputStream.java:109)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.processIOError(FSEditLog.java:299)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.getEditLogSize(FSEditLog.java:849)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getEditLogSize(FSNamesystem.java:4270)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.getEditLogSize(NameNode.java:1095)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:346)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1399)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1395)
at java.security.AccessController.doPrivileged(Native Method)
at 

[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-05-31 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041859#comment-13041859
 ] 

Ravi Prakash commented on HDFS-2011:


Thanks for your comments Todd and Matt! :)

I'm working on a unit test. I'm almost done. 

Sorry for the junking the emails. I've edited the JIRA and shortened the 
description. I promise it won't happen again. Thanks for the advice. :)

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.patch


 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly. Sometimes it throws a NullPointerException and 
 sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-05-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041195#comment-13041195
 ] 

Hadoop QA commented on HDFS-2011:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12480858/HDFS-2011.patch
  against trunk revision 1128987.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 patch.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/659//console

This message is automatically generated.

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.patch


 I had been automating tests to verify the removal and restoration of storage 
 directories. I was testing by setting up a loopback file system, using that 
 as one of the storage directories, and filling it up to make the writes from 
 Hadoop namenode to the checkpoint fail. 
 Mostly I would see the functionality work. However, very often I would see 
 this exception in the logs: 
 2011-05-29 23:34:30,241 WARN org.mortbay.log: /getimage: java.io.IOException: 
 GetImage failed. java.io.IOException: No space left on device
 at java.io.FileOutputStream.writeBytes(Native Method)
 at java.io.FileOutputStream.write(FileOutputStream.java:297)
 at 
 org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:224)
 at 
 org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:101)
 at 
 org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:98)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:416)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131)
 at 
 org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:97)
 at 
 org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:74)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:416)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131)
 at 
 org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:74)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
 at 
 org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124)
 at 
 org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:871)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115)
 at 
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361)
 at 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at 
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
 at 
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
 at 
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
 at 
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
 at 
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
 at org.mortbay.jetty.Server.handle(Server.java:324)
 at 
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
 at 
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
 at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
 at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
 at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
 at 
 org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
 at 
 

[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

2011-05-30 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041200#comment-13041200
 ] 

Todd Lipcon commented on HDFS-2011:
---

Any chance of unit tests for these?

 Removal and restoration of storage directories on checkpointing failure 
 doesn't work properly
 -

 Key: HDFS-2011
 URL: https://issues.apache.org/jira/browse/HDFS-2011
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Assignee: Ravi Prakash
 Attachments: HDFS-2011.patch


 I had been automating tests to verify the removal and restoration of storage 
 directories. I was testing by setting up a loopback file system, using that 
 as one of the storage directories, and filling it up to make the writes from 
 Hadoop namenode to the checkpoint fail. 
 Mostly I would see the functionality work. However, very often I would see 
 this exception in the logs: 
 2011-05-29 23:34:30,241 WARN org.mortbay.log: /getimage: java.io.IOException: 
 GetImage failed. java.io.IOException: No space left on device
 at java.io.FileOutputStream.writeBytes(Native Method)
 at java.io.FileOutputStream.write(FileOutputStream.java:297)
 at 
 org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:224)
 at 
 org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:101)
 at 
 org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:98)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:416)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131)
 at 
 org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:97)
 at 
 org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:74)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:416)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131)
 at 
 org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:74)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
 at 
 org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124)
 at 
 org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:871)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115)
 at 
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361)
 at 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at 
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
 at 
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
 at 
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
 at 
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
 at 
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
 at org.mortbay.jetty.Server.handle(Server.java:324)
 at 
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
 at 
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
 at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
 at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
 at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
 at 
 org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
 at 
 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
 In this case the storage directory wasn't taken offline. It would not be 
 removed from the list. John George figured out this was because the 
 IOException was happening in a code path fromm where the function to remove 
 the corresponding wasn't being called. 
 Also, very rarely, I would see this exception
 2011-04-05 17:36:56,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
 87 on 8020, call getEditLogSize() from
 98.137.97.99:35862: error: java.io.IOException: java.lang.NullPointerException
 java.io.IOException: java.lang.NullPointerException
 at