[ 
https://issues.apache.org/jira/browse/HDFS-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851164#action_12851164
 ] 

stack commented on HDFS-1024:
-----------------------------

I will commit this issue in next day or so unless objection or unless someone 
else beats me to it.  IMO, this issue is a blocker as it can corrupt fsimage.

We just ran into this issue except the faliure was in the transfer from 2NN to 
NN.  Only a partial copy was done because we ran into a CancelledKeyException.  
Thereafter our fsimage is corrupt.   Subsequent 2NN copies of image all fail 
because they fail to read the mangled image.

Here's the evidence.

This is the 2NN log from just before the CancelledKeyException.

{code}
2010-03-28 00:43:28,173 INFO 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file 
fsimage size 9220979 bytes.
2010-03-28 00:43:28,202 INFO 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file edits 
size 3190207 bytes.
2010-03-28 00:43:28,288 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
2010-03-28 00:43:28,288 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
2010-03-28 00:43:28,288 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true
2010-03-28 00:43:28,291 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Number of files = 64126
2010-03-28 00:43:29,016 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Number of files under construction = 15
2010-03-28 00:43:29,510 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Edits file /export/hadoop/dfs/namesecondary/current/edits of size 3190207 edits 
# 30862 loaded in 0 seconds.
2010-03-28 00:43:29,754 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Image file of size 9244565 saved in 0 seconds.
2010-03-28 00:43:29,922 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions: 0 
Total time for transactions(ms): 0Number of transactions batched in Syncs: 0 
Number of syncs: 0 SyncTimes(ms): 0
2010-03-28 00:43:30,058 INFO 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL 
10.10.20.221:50070putimage=1&port=50090&machine=10.10.20.222&token=-18:996348214:1231385509605:1269762207000:1269758607364
2010-03-28 00:43:37,514 WARN org.mortbay.log: Committed before 410 GetImage 
failed. java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
        at 
org.mortbay.io.nio.SelectChannelEndPoint.updateKey(SelectChannelEndPoint.java:324)
        at 
org.mortbay.io.nio.SelectChannelEndPoint.blockWritable(SelectChannelEndPoint.java:278)
        at 
org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:542)
        at 
org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569)
        at 
org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:946)
        at 
org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:646)
        at 
org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:577)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileServer(TransferFsImage.java:127)
        at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:49)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
        at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
        at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
        at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
        at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:324)
        at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
        at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
        at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
        at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)

2010-03-28 00:43:37,516 ERROR org.mortbay.log: /getimage
java.lang.IllegalStateException: Committed
        at org.mortbay.jetty.Response.resetBuffer(Response.java:994)
        at org.mortbay.jetty.Response.sendError(Response.java:240)
        at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:64)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
        at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
        at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
        at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
        at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:324)
        at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
        at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
        at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
        at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
2010-03-28 00:43:37,724 WARN 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done. New 
Image Size: 9244565
2010-03-28 01:43:37,747 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions: 0 
Total time for transactions(ms): 0Number of transactions batched in Syncs: 0 
Number of syncs: 0 SyncTimes(ms): 0 
2010-03-28 01:43:38,159 INFO 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file 
fsimage size 4521984 bytes.
2010-03-28 01:43:38,184 INFO 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file edits 
size 2745394 bytes.
2010-03-28 01:43:38,262 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
2010-03-28 01:43:38,262 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
2010-03-28 01:43:38,262 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true
2010-03-28 01:43:38,264 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Number of files = 64295
2010-03-28 01:43:38,613 ERROR 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in 
doCheckpoint: 
2010-03-28 01:43:38,613 ERROR 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at java.io.DataInputStream.readLong(DataInputStream.java:399)
        at org.apache.hadoop.hdfs.protocol.Block.readFields(Block.java:136)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:904)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:589)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:473)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:350)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:314)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:225)
        at java.lang.Thread.run(Thread.java:619)

2010-03-28 01:48:38,622 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions: 0 
Total time for transactions(ms): 0Number of transactions batched in Syncs: 0 
Number of syncs: 0 SyncTimes(ms): 0 
2010-03-28 01:48:38,651 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Recovering storage directory /export/hadoop/dfs/namesecondary from failed 
checkpoint.
2010-03-28 01:48:38,696 INFO 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file 
fsimage size 4521984 bytes.
2010-03-28 01:48:38,721 INFO 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file edits 
size 2745394 bytes.
2010-03-28 01:48:38,804 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
2010-03-28 01:48:38,805 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
2010-03-28 01:48:38,805 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true
2010-03-28 01:48:38,806 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Number of files = 64295
2010-03-28 01:48:39,173 ERROR 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in 
doCheckpoint: 
2010-03-28 01:48:39,174 ERROR 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at java.io.DataInputStream.readLong(DataInputStream.java:399)
        at org.apache.hadoop.hdfs.protocol.Block.readFields(Block.java:136)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:904)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:589)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:473)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:350)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:314)
        at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:225)
        at java.lang.Thread.run(Thread.java:619)
....
{code}

And so on.

I don't see log of NN-side of things.  We were not running with DEBUG enabled.

> SecondaryNamenode fails to checkpoint because namenode fails with 
> CancelledKeyException
> ---------------------------------------------------------------------------------------
>
>                 Key: HDFS-1024
>                 URL: https://issues.apache.org/jira/browse/HDFS-1024
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 0.22.0
>            Reporter: dhruba borthakur
>            Assignee: Dmytro Molkov
>             Fix For: 0.22.0
>
>         Attachments: HDFS-1024.patch, HDFS-1024.patch.1
>
>
> The secondary namenode fails to retrieve the entire fsimage from the 
> Namenode. It fetches a part of the fsimage but believes that it has fetched 
> the entire fsimage file and proceeds ahead with the checkpointing. Stack 
> traces will be attached below.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to