[ 
https://issues.apache.org/jira/browse/HDFS-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated HDFS-2011:
-------------------------------

    Description: Removal and restoration of storage directories on 
checkpointing failure doesn't work properly. Sometimes it throws a 
NullPointerException and sometimes it doesn't take off a failed storage 
directory  (was: I had been automating tests to verify the removal and 
restoration of storage directories. I was testing by setting up a loopback file 
system, using that as one of the storage directories, and filling it up to make 
the writes from Hadoop namenode to the checkpoint fail. 
Mostly I would see the functionality work. However, very often I would see this 
exception in the logs: 

2011-05-29 23:34:30,241 WARN org.mortbay.log: /getimage: java.io.IOException: 
GetImage failed. java.io.IOException: No space left on device
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:297)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:224)
        at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:101)
        at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:98)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:416)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131)
        at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:97)
        at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:74)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:416)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131)
        at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:74)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
        at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124)
        at 
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:871)
        at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115)
        at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361)
        at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
        at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
        at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:324)
        at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
        at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
        at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
        at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)

In this case the storage directory wasn't taken offline. It would not be 
removed from the list. John George figured out this was because the IOException 
was happening in a code path fromm where the function to remove the 
corresponding wasn't being called. 

Also, very rarely, I would see this exception

2011-04-05 17:36:56,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
87 on 8020, call getEditLogSize() from
98.137.97.99:35862: error: java.io.IOException: java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
        at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.close(EditLogFileOutputStream.java:109)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.processIOError(FSEditLog.java:299)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.getEditLogSize(FSEditLog.java:849)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getEditLogSize(FSNamesystem.java:4270)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.getEditLogSize(NameNode.java:1095)
        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:346)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1399)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1395)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1094)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1393)

After this, the Secondary Namenode and the Namenode would go into infinite 
loops of this NullPointerExceptions. John George figured out this was because 
close was being called on the editStream twice (so it was trying to close an 
editstream which was already closed). 

)

> Removal and restoration of storage directories on checkpointing failure 
> doesn't work properly
> ---------------------------------------------------------------------------------------------
>
>                 Key: HDFS-2011
>                 URL: https://issues.apache.org/jira/browse/HDFS-2011
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Ravi Prakash
>         Attachments: HDFS-2011.patch
>
>
> Removal and restoration of storage directories on checkpointing failure 
> doesn't work properly. Sometimes it throws a NullPointerException and 
> sometimes it doesn't take off a failed storage directory

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to