[ https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375299#comment-16375299 ]
He Xiaoqiao commented on HDFS-12749: ------------------------------------ [~xkrogen] Thanks for your review and comments. {quote}Are you running without Kerberos? Do you see a relevant WARN statement from the log for ipc.Client? {quote} a. This is security cluster with Kerberos, b. All relevant WARN or EXCEPTION depict as [~tanyuxin] mentioned above (description & second comment.) Based on the exception logs that [~tanyuxin] provided, I think {{SocketTimeoutException}} was over-wrapped in extra {{IOException}} by {{Client#cleanupCalls}}, the following notes based branch-2.7 (maybe I am wrong, if that please correct me.) a. Client#call (line:1448) throws {{IOException}} which wrapped {{SocketTimeoutException}} when {{call.error}} is not null and it is not instance of {{RemoteException}}, thus this exception is wrapped by {{NetUtils#wrapException}}: {code:java} public Writable call(RPC.RpcKind rpcKind, Writable rpcRequest, ConnectionId remoteId, int serviceClass, AtomicBoolean fallbackToSimpleAuth) throws IOException { final Call call = createCall(rpcKind, rpcRequest); Connection connection = getConnection(remoteId, call, serviceClass, fallbackToSimpleAuth); try { connection.sendRpcRequest(call); // send the rpc request } catch (RejectedExecutionException e) { throw new IOException("connection has been closed", e); } catch (InterruptedException e) { Thread.currentThread().interrupt(); LOG.warn("interrupted waiting to send rpc request to server", e); throw new IOException(e); } synchronized (call) { while (!call.done) { try { call.wait(); // wait for the result } catch (InterruptedException ie) { Thread.currentThread().interrupt(); throw new InterruptedIOException("Call interrupted"); } } if (call.error != null) { if (call.error instanceof RemoteException) { call.error.fillInStackTrace(); throw call.error; } else { // local exception InetSocketAddress address = connection.getRemoteAddress(); throw NetUtils.wrapException(address.getHostName(), address.getPort(), NetUtils.getHostname(), 0, call.error); } } else { return call.getRpcResponse(); } } } {code} b. {{NetUtils#wrapException}} can distinguish {{SocketTimeoutException}} if {{call.error}} is instance of, but not actually so logs `Failed on local exception: java.io.IOException: ...` c. {{call.error}} is set only by client#setException which invoked by {{Client#receiveRpcResponse}} and {{Client#cleanupCalls}}, however {{call.error}} is set RemoteException always in {{Client#receiveRpcResponse}}. Evidently, the only possibility is that {{SocketTimeoutException}} was over-wrapped in {{IOException}} by {{Client#cleanupCalls}}. d.The key point in {{Client#cleanupCalls}} is {{#closeException}} which is set by {{Client#markClosed}} invoked by {{Client#sendRpcRequest}} and it catch all {{IOException}} then set {{#closeException}} equal it. {code:java} public void sendRpcRequest(final Call call) throws InterruptedException, IOException { ...... synchronized (sendRpcRequestLock) { Future<?> senderFuture = sendParamsExecutor.submit(new Runnable() { @Override public void run() { try { ...... } catch (IOException e) { // exception at this point would leave the connection in an // unrecoverable state (eg half a call left on the wire). // So, close the connection, killing any outstanding calls markClosed(e); } finally { //the buffer is just an in-memory buffer, but it is still polite to // close early IOUtils.closeStream(d); } } }); ..... } } {code} [~xkrogen],[~kihwal] any suggestions? > DN may not send block report to NN after NN restart > --------------------------------------------------- > > Key: HDFS-12749 > URL: https://issues.apache.org/jira/browse/HDFS-12749 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1 > Reporter: TanYuxin > Priority: Major > Attachments: HDFS-12749-branch-2.7.002.patch, > HDFS-12749-trunk.003.patch, HDFS-12749.001.patch > > > Now our cluster have thousands of DN, millions of files and blocks. When NN > restart, NN's load is very high. > After NN restart,DN will call BPServiceActor#reRegister method to register. > But register RPC will get a IOException since NN is busy dealing with Block > Report. The exception is caught at BPServiceActor#processCommand. > Next is the caught IOException: > {code:java} > WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing > datanode Command > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 60000 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local > host is: "DataNode_Host/Datanode_IP"; destination host is: > "NameNode_Host":Port; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773) > at org.apache.hadoop.ipc.Client.call(Client.java:1474) > at org.apache.hadoop.ipc.Client.call(Client.java:1407) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864) > at java.lang.Thread.run(Thread.java:745) > {code} > The un-catched IOException breaks BPServiceActor#register, and the Block > Report can not be sent immediately. > {code} > /** > * Register one bp with the corresponding NameNode > * <p> > * The bpDatanode needs to register with the namenode on startup in order > * 1) to report which storage it is serving now and > * 2) to receive a registrationID > * > * issued by the namenode to recognize registered datanodes. > * > * @param nsInfo current NamespaceInfo > * @see FSNamesystem#registerDatanode(DatanodeRegistration) > * @throws IOException > */ > void register(NamespaceInfo nsInfo) throws IOException { > // The handshake() phase loaded the block pool storage > // off disk - so update the bpRegistration object from that info > DatanodeRegistration newBpRegistration = bpos.createRegistration(); > LOG.info(this + " beginning handshake with NN"); > while (shouldRun()) { > try { > // Use returned registration from namenode with updated fields > newBpRegistration = bpNamenode.registerDatanode(newBpRegistration); > newBpRegistration.setNamespaceInfo(nsInfo); > bpRegistration = newBpRegistration; > break; > } catch(EOFException e) { // namenode might have just restarted > LOG.info("Problem connecting to server: " + nnAddr + " :" > + e.getLocalizedMessage()); > sleepAndLogInterrupts(1000, "connecting to server"); > } catch(SocketTimeoutException e) { // namenode is busy > LOG.info("Problem connecting to server: " + nnAddr); > sleepAndLogInterrupts(1000, "connecting to server"); > } > } > > LOG.info("Block pool " + this + " successfully registered with NN"); > bpos.registrationSucceeded(this, bpRegistration); > // random short delay - helps scatter the BR from all DNs > scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay); > } > {code} > But NameNode has processed registerDatanode successfully, so it won't ask DN > to re-register again -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org