[
https://issues.apache.org/jira/browse/HDFS-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851164#action_12851164
]
stack commented on HDFS-1024:
-----------------------------
I will commit this issue in next day or so unless objection or unless someone
else beats me to it. IMO, this issue is a blocker as it can corrupt fsimage.
We just ran into this issue except the faliure was in the transfer from 2NN to
NN. Only a partial copy was done because we ran into a CancelledKeyException.
Thereafter our fsimage is corrupt. Subsequent 2NN copies of image all fail
because they fail to read the mangled image.
Here's the evidence.
This is the 2NN log from just before the CancelledKeyException.
{code}
2010-03-28 00:43:28,173 INFO
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file
fsimage size 9220979 bytes.
2010-03-28 00:43:28,202 INFO
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file edits
size 3190207 bytes.
2010-03-28 00:43:28,288 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
2010-03-28 00:43:28,288 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
2010-03-28 00:43:28,288 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true
2010-03-28 00:43:28,291 INFO org.apache.hadoop.hdfs.server.common.Storage:
Number of files = 64126
2010-03-28 00:43:29,016 INFO org.apache.hadoop.hdfs.server.common.Storage:
Number of files under construction = 15
2010-03-28 00:43:29,510 INFO org.apache.hadoop.hdfs.server.common.Storage:
Edits file /export/hadoop/dfs/namesecondary/current/edits of size 3190207 edits
# 30862 loaded in 0 seconds.
2010-03-28 00:43:29,754 INFO org.apache.hadoop.hdfs.server.common.Storage:
Image file of size 9244565 saved in 0 seconds.
2010-03-28 00:43:29,922 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions: 0
Total time for transactions(ms): 0Number of transactions batched in Syncs: 0
Number of syncs: 0 SyncTimes(ms): 0
2010-03-28 00:43:30,058 INFO
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
10.10.20.221:50070putimage=1&port=50090&machine=10.10.20.222&token=-18:996348214:1231385509605:1269762207000:1269758607364
2010-03-28 00:43:37,514 WARN org.mortbay.log: Committed before 410 GetImage
failed. java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
at
org.mortbay.io.nio.SelectChannelEndPoint.updateKey(SelectChannelEndPoint.java:324)
at
org.mortbay.io.nio.SelectChannelEndPoint.blockWritable(SelectChannelEndPoint.java:278)
at
org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:542)
at
org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569)
at
org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:946)
at
org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:646)
at
org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:577)
at
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileServer(TransferFsImage.java:127)
at
org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:49)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
at
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
2010-03-28 00:43:37,516 ERROR org.mortbay.log: /getimage
java.lang.IllegalStateException: Committed
at org.mortbay.jetty.Response.resetBuffer(Response.java:994)
at org.mortbay.jetty.Response.sendError(Response.java:240)
at
org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:64)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
at
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
2010-03-28 00:43:37,724 WARN
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done. New
Image Size: 9244565
2010-03-28 01:43:37,747 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions: 0
Total time for transactions(ms): 0Number of transactions batched in Syncs: 0
Number of syncs: 0 SyncTimes(ms): 0
2010-03-28 01:43:38,159 INFO
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file
fsimage size 4521984 bytes.
2010-03-28 01:43:38,184 INFO
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file edits
size 2745394 bytes.
2010-03-28 01:43:38,262 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
2010-03-28 01:43:38,262 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
2010-03-28 01:43:38,262 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true
2010-03-28 01:43:38,264 INFO org.apache.hadoop.hdfs.server.common.Storage:
Number of files = 64295
2010-03-28 01:43:38,613 ERROR
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in
doCheckpoint:
2010-03-28 01:43:38,613 ERROR
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at org.apache.hadoop.hdfs.protocol.Block.readFields(Block.java:136)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:904)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:589)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:473)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:350)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:314)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:225)
at java.lang.Thread.run(Thread.java:619)
2010-03-28 01:48:38,622 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions: 0
Total time for transactions(ms): 0Number of transactions batched in Syncs: 0
Number of syncs: 0 SyncTimes(ms): 0
2010-03-28 01:48:38,651 INFO org.apache.hadoop.hdfs.server.common.Storage:
Recovering storage directory /export/hadoop/dfs/namesecondary from failed
checkpoint.
2010-03-28 01:48:38,696 INFO
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file
fsimage size 4521984 bytes.
2010-03-28 01:48:38,721 INFO
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file edits
size 2745394 bytes.
2010-03-28 01:48:38,804 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
2010-03-28 01:48:38,805 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
2010-03-28 01:48:38,805 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true
2010-03-28 01:48:38,806 INFO org.apache.hadoop.hdfs.server.common.Storage:
Number of files = 64295
2010-03-28 01:48:39,173 ERROR
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in
doCheckpoint:
2010-03-28 01:48:39,174 ERROR
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at org.apache.hadoop.hdfs.protocol.Block.readFields(Block.java:136)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:904)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:589)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:473)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:350)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:314)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:225)
at java.lang.Thread.run(Thread.java:619)
....
{code}
And so on.
I don't see log of NN-side of things. We were not running with DEBUG enabled.
> SecondaryNamenode fails to checkpoint because namenode fails with
> CancelledKeyException
> ---------------------------------------------------------------------------------------
>
> Key: HDFS-1024
> URL: https://issues.apache.org/jira/browse/HDFS-1024
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 0.22.0
> Reporter: dhruba borthakur
> Assignee: Dmytro Molkov
> Fix For: 0.22.0
>
> Attachments: HDFS-1024.patch, HDFS-1024.patch.1
>
>
> The secondary namenode fails to retrieve the entire fsimage from the
> Namenode. It fetches a part of the fsimage but believes that it has fetched
> the entire fsimage file and proceeds ahead with the checkpointing. Stack
> traces will be attached below.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.