Re: NameNode - didn't persist the edit log
Hi Guy, Eli has been looking into these issues and it looks like you found a nasty bug. You can follow these JIRAs to track resolution: HDFS-2701, HDFS-2702, HDFS-2703. I think in particular HDFS-2703 is the one that bit you here. -Todd On Thu, Dec 15, 2011 at 2:06 AM, Guy Doulberg guy.doulb...@conduit.com wrote: Hi Todd, you are right I should be more specific: 1. from the namenode log: 2011-12-11 08:57:23,245 WARN org.apache.hadoop.hdfs.server.common.Storage: rollEdidLog: removing storage /srv/hadoop/hdfs/edit 2011-12-11 08:57:23,311 WARN org.apache.hadoop.hdfs.server.common.Storage: incrementCheckpointTime failed on /srv/hadoop/hdfs/name;type=IMAGE 2011-12-11 08:57:23,316 WARN org.mortbay.log: /getimage: java.io.IOException: GetImage failed. java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:83) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:78) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:78) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:829) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) 2. Embarrassingly no, we had only, now we have 2 and periodically backups , :( 3. Yes 4. hadoop version Hadoop 0.20.2-cdh3u2 Subversion file:///tmp/nightly_2011-10-13_20-02-02_3/hadoop-0.20-0.20.2+923.142-1~lucid -r 95a824e4005b2a94fe1c11f1ef9db4c672ba43cb Compiled by root on Thu Oct 13 21:52:18 PDT 2011 From source with checksum 644e5db6c59d45bca96cec7f220dda51 Thanks, Guy On Thu 15 Dec 2011 11:39:26 AM IST, Todd Lipcon wrote: Hi Guy, Several questions come to mind here: - What was the exact WARN level message you saw? - Did you have multiple dfs.name.dirs configured as recommended by most setup guides? - Did you try entering safemode and then running saveNamespace to persist the image before shutting down the NN? This would have saved your data. - What exact version of HDFS were you running? This is certainly not expected behavior... all of the places where an edit log fails have a check against there being 0 edit logs remaining and should issue a FATAL level message followed by a System.exit(-1). -Todd On Thu, Dec 15, 2011 at 1:16 AM, Guy Doulbergguy.doulb...@conduit.com wrote: Hi guys, We recently had the following problem on our production cluster: The filesystem containing the editlog and fsimage had no free inodes. As a result the namenode wasn't able to obtain an inode for the fsimage and editlog after a checkpiot has been reached, while the previous files were freed. Unfortunately, we had no monitoring on the inodes number, so it happens that the namenode ran in this state for a few hours. We have noticed this failure in its DFS-status page. But the namenode didn't enter safe-mode, so all the writes were made couldn't be persisted to the editlog. After discovering the problem we freed inodes, and the file-system seemed to be okay again, we tried to force the namenode to persist to editlog with no success,
NameNode - didn't persist the edit log
Hi guys, We recently had the following problem on our production cluster: The filesystem containing the editlog and fsimage had no free inodes. As a result the namenode wasn't able to obtain an inode for the fsimage and editlog after a checkpiot has been reached, while the previous files were freed. Unfortunately, we had no monitoring on the inodes number, so it happens that the namenode ran in this state for a few hours. We have noticed this failure in its DFS-status page. But the namenode didn't enter safe-mode, so all the writes were made couldn't be persisted to the editlog. After discovering the problem we freed inodes, and the file-system seemed to be okay again, we tried to force the namenode to persist to editlog with no success, Eventually, we restarted the namenode -which of-course caused us to lose all the data that was written to the hdfs during these few hours (fortunately we have backup of the recent writes - so we restored the data from there ) This situation raises some severe concerns, 1. How come the namenode identified a failure in persisting its editlog and didn't enter safe-mode? (The exception was given only a WARN -severity and not a CRITICAL) 2. How come after we freed inodes, we couldn't persist the namenode? Maybe there should be a command in the CLI to should enable us to force the namenode to persist its editlog Do you know of a JIRA opened for these issue, or should I open one? Thanks Guy
Re: NameNode - didn't persist the edit log
Hi Todd, you are right I should be more specific: 1. from the namenode log: 2011-12-11 08:57:23,245 WARN org.apache.hadoop.hdfs.server.common.Storage: rollEdidLog: removing storage /srv/hadoop/hdfs/edit 2011-12-11 08:57:23,311 WARN org.apache.hadoop.hdfs.server.common.Storage: incrementCheckpointTime failed on /srv/hadoop/hdfs/name;type=IMAGE 2011-12-11 08:57:23,316 WARN org.mortbay.log: /getimage: java.io.IOException: GetImage failed. java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:83) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:78) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:78) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:829) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) 2. Embarrassingly no, we had only, now we have 2 and periodically backups , :( 3. Yes 4. hadoop version Hadoop 0.20.2-cdh3u2 Subversion file:///tmp/nightly_2011-10-13_20-02-02_3/hadoop-0.20-0.20.2+923.142-1~lucid -r 95a824e4005b2a94fe1c11f1ef9db4c672ba43cb Compiled by root on Thu Oct 13 21:52:18 PDT 2011 From source with checksum 644e5db6c59d45bca96cec7f220dda51 Thanks, Guy On Thu 15 Dec 2011 11:39:26 AM IST, Todd Lipcon wrote: Hi Guy, Several questions come to mind here: - What was the exact WARN level message you saw? - Did you have multiple dfs.name.dirs configured as recommended by most setup guides? - Did you try entering safemode and then running saveNamespace to persist the image before shutting down the NN? This would have saved your data. - What exact version of HDFS were you running? This is certainly not expected behavior... all of the places where an edit log fails have a check against there being 0 edit logs remaining and should issue a FATAL level message followed by a System.exit(-1). -Todd On Thu, Dec 15, 2011 at 1:16 AM, Guy Doulbergguy.doulb...@conduit.com wrote: Hi guys, We recently had the following problem on our production cluster: The filesystem containing the editlog and fsimage had no free inodes. As a result the namenode wasn't able to obtain an inode for the fsimage and editlog after a checkpiot has been reached, while the previous files were freed. Unfortunately, we had no monitoring on the inodes number, so it happens that the namenode ran in this state for a few hours. We have noticed this failure in its DFS-status page. But the namenode didn't enter safe-mode, so all the writes were made couldn't be persisted to the editlog. After discovering the problem we freed inodes, and the file-system seemed to be okay again, we tried to force the namenode to persist to editlog with no success, Eventually, we restarted the namenode -which of-course caused us to lose all the data that was written to the hdfs during these few hours (fortunately we have backup of the recent writes - so we restored the data from there ) This situation raises some severe concerns, 1. How come the namenode identified a failure in persisting its editlog and