[ https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13041200#comment-13041200 ]
Todd Lipcon commented on HDFS-2011: ----------------------------------- Any chance of unit tests for these? > Removal and restoration of storage directories on checkpointing failure > doesn't work properly > --------------------------------------------------------------------------------------------- > > Key: HDFS-2011 > URL: https://issues.apache.org/jira/browse/HDFS-2011 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Affects Versions: 0.23.0 > Reporter: Ravi Prakash > Assignee: Ravi Prakash > Attachments: HDFS-2011.patch > > > I had been automating tests to verify the removal and restoration of storage > directories. I was testing by setting up a loopback file system, using that > as one of the storage directories, and filling it up to make the writes from > Hadoop namenode to the checkpoint fail. > Mostly I would see the functionality work. However, very often I would see > this exception in the logs: > 2011-05-29 23:34:30,241 WARN org.mortbay.log: /getimage: java.io.IOException: > GetImage failed. java.io.IOException: No space left on device > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:297) > at > org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:224) > at > org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:101) > at > org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:98) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:416) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131) > at > org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:97) > at > org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:74) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:416) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131) > at > org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:74) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124) > at > org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:871) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:324) > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) > at > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) > at > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) > at > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522) > In this case the storage directory wasn't taken offline. It would not be > removed from the list. John George figured out this was because the > IOException was happening in a code path fromm where the function to remove > the corresponding wasn't being called. > Also, very rarely, I would see this exception > 2011-04-05 17:36:56,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 87 on 8020, call getEditLogSize() from > 98.137.97.99:35862: error: java.io.IOException: java.lang.NullPointerException > java.io.IOException: java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.close(EditLogFileOutputStream.java:109) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.processIOError(FSEditLog.java:299) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.getEditLogSize(FSEditLog.java:849) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getEditLogSize(FSNamesystem.java:4270) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.getEditLogSize(NameNode.java:1095) > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:346) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1399) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1395) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1094) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1393) > After this, the Secondary Namenode and the Namenode would go into infinite > loops of this NullPointerExceptions. John George figured out this was because > close was being called on the editStream twice (so it was trying to close an > editstream which was already closed). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira