[ https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
He Xiaoqiao updated HDFS-12749: ------------------------------- Attachment: HDFS-12749-trunk.004.patch > DN may not send block report to NN after NN restart > --------------------------------------------------- > > Key: HDFS-12749 > URL: https://issues.apache.org/jira/browse/HDFS-12749 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1 > Reporter: TanYuxin > Assignee: He Xiaoqiao > Priority: Major > Attachments: HDFS-12749-branch-2.7.002.patch, > HDFS-12749-trunk.003.patch, HDFS-12749-trunk.004.patch, HDFS-12749.001.patch > > > Now our cluster have thousands of DN, millions of files and blocks. When NN > restart, NN's load is very high. > After NN restart,DN will call BPServiceActor#reRegister method to register. > But register RPC will get a IOException since NN is busy dealing with Block > Report. The exception is caught at BPServiceActor#processCommand. > Next is the caught IOException: > {code:java} > WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing > datanode Command > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 60000 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local > host is: "DataNode_Host/Datanode_IP"; destination host is: > "NameNode_Host":Port; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773) > at org.apache.hadoop.ipc.Client.call(Client.java:1474) > at org.apache.hadoop.ipc.Client.call(Client.java:1407) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864) > at java.lang.Thread.run(Thread.java:745) > {code} > The un-catched IOException breaks BPServiceActor#register, and the Block > Report can not be sent immediately. > {code} > /** > * Register one bp with the corresponding NameNode > * <p> > * The bpDatanode needs to register with the namenode on startup in order > * 1) to report which storage it is serving now and > * 2) to receive a registrationID > * > * issued by the namenode to recognize registered datanodes. > * > * @param nsInfo current NamespaceInfo > * @see FSNamesystem#registerDatanode(DatanodeRegistration) > * @throws IOException > */ > void register(NamespaceInfo nsInfo) throws IOException { > // The handshake() phase loaded the block pool storage > // off disk - so update the bpRegistration object from that info > DatanodeRegistration newBpRegistration = bpos.createRegistration(); > LOG.info(this + " beginning handshake with NN"); > while (shouldRun()) { > try { > // Use returned registration from namenode with updated fields > newBpRegistration = bpNamenode.registerDatanode(newBpRegistration); > newBpRegistration.setNamespaceInfo(nsInfo); > bpRegistration = newBpRegistration; > break; > } catch(EOFException e) { // namenode might have just restarted > LOG.info("Problem connecting to server: " + nnAddr + " :" > + e.getLocalizedMessage()); > sleepAndLogInterrupts(1000, "connecting to server"); > } catch(SocketTimeoutException e) { // namenode is busy > LOG.info("Problem connecting to server: " + nnAddr); > sleepAndLogInterrupts(1000, "connecting to server"); > } > } > > LOG.info("Block pool " + this + " successfully registered with NN"); > bpos.registrationSucceeded(this, bpRegistration); > // random short delay - helps scatter the BR from all DNs > scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay); > } > {code} > But NameNode has processed registerDatanode successfully, so it won't ask DN > to re-register again -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org