[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

Matt Foley (JIRA) Tue, 31 May 2011 10:52:33 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13041707#comment-13041707
 ]


Matt Foley commented on HDFS-2011:
----------------------------------

Hi Ravi, for future reference please write a short Description field, then add 
the long details in a first Comment.  The problem is the Description gets 
re-sent in every Jira email about the ticket.  Thanks.

> Removal and restoration of storage directories on checkpointing failure 
> doesn't work properly
> ---------------------------------------------------------------------------------------------
>
>                 Key: HDFS-2011
>                 URL: https://issues.apache.org/jira/browse/HDFS-2011
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Ravi Prakash
>         Attachments: HDFS-2011.patch
>
>
> I had been automating tests to verify the removal and restoration of storage 
> directories. I was testing by setting up a loopback file system, using that 
> as one of the storage directories, and filling it up to make the writes from 
> Hadoop namenode to the checkpoint fail. 
> Mostly I would see the functionality work. However, very often I would see 
> this exception in the logs: 
> 2011-05-29 23:34:30,241 WARN org.mortbay.log: /getimage: java.io.IOException: 
> GetImage failed. java.io.IOException: No space left on device
>         at java.io.FileOutputStream.writeBytes(Native Method)
>         at java.io.FileOutputStream.write(FileOutputStream.java:297)
>         at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:224)
>         at 
> org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:101)
>         at 
> org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:98)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:416)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131)
>         at 
> org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:97)
>         at 
> org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:74)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:416)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131)
>         at 
> org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:74)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>         at 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
>         at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124)
>         at 
> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:871)
>         at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115)
>         at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361)
>         at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>         at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>         at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>         at 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
>         at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>         at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>         at org.mortbay.jetty.Server.handle(Server.java:324)
>         at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
>         at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
>         at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
>         at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
>         at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
>         at 
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
>         at 
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
> In this case the storage directory wasn't taken offline. It would not be 
> removed from the list. John George figured out this was because the 
> IOException was happening in a code path fromm where the function to remove 
> the corresponding wasn't being called. 
> Also, very rarely, I would see this exception
> 2011-04-05 17:36:56,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 87 on 8020, call getEditLogSize() from
> 98.137.97.99:35862: error: java.io.IOException: java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
>         at 
> org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.close(EditLogFileOutputStream.java:109)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.processIOError(FSEditLog.java:299)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.getEditLogSize(FSEditLog.java:849)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getEditLogSize(FSNamesystem.java:4270)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.getEditLogSize(NameNode.java:1095)
>         at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at 
> org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:346)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1399)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1395)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1094)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1393)
> After this, the Secondary Namenode and the Namenode would go into infinite 
> loops of this NullPointerExceptions. John George figured out this was because 
> close was being called on the editStream twice (so it was trying to close an 
> editstream which was already closed). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-2011) Removal and restoration of storage directories on checkpointing failure doesn't work properly

Reply via email to