adoroszlai opened a new pull request #2882: URL: https://github.com/apache/ozone/pull/2882
## What changes were proposed in this pull request? In secure environments, datanode's request to addGroup in other datanodes fails with `Network closed for unknown reason`. Thus pipeline creation essentially becomes async, just like before HDDS-2679. The problem is that Ratis sends group management request via a separate admin channel, but currently only real "client" channel is configured for TLS when performing the add group request. We need to configure TLS for talking to the admin endpoint, too. https://issues.apache.org/jira/browse/HDDS-6061 ## How was this patch tested? Added temporary code to consistently reproduce the problem in `ozonesecure` env: ``` diff --git hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/CreatePipelineCommandHandler.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/CreatePipelineCommandHandler.java index 687b6be06..66709aaf6 100644 --- hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/CreatePipelineCommandHandler.java +++ hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/CreatePipelineCommandHandler.java @@ -82,6 +82,10 @@ public void handle(SCMCommand command, OzoneContainer ozoneContainer, final HddsProtos.PipelineID pipelineIdProto = pipelineID.getProtobuf(); final List<DatanodeDetails> peers = createCommand.getNodeList(); final List<Integer> priorityList = createCommand.getPriorityList(); + if (dn.getHostName().contains("2")) { + LOG.info("ZZZ ignore create pipeline command from SCM for {}", dn.getHostName()); + return; + } try { XceiverServerSpi server = ozoneContainer.getWriteChannel(); ``` and reproduced it: ``` datanode_2 | 2021-12-02 14:18:52,840 [Command processor thread] INFO commandhandler.CreatePipelineCommandHandler: ZZZ ignore create pipeline command from SCM for ozonesecure_datanode_2.ozonesecure_default ... datanode_3 | 2021-12-02 14:18:54,506 [Command processor thread] INFO ratis.XceiverServerRatis: Created group PipelineID=ffca2383-1bb9-4bfe-86dd-ea595e16cbb7 datanode_3 | 2021-12-02 14:18:55,896 [Command processor thread] WARN commandhandler.CreatePipelineCommandHandler: Add group failed for e47c6feb-bd55-4d55-8d62-26a4d879dfe3{ip: 172.18.0.9, host: ozonesecure_datanode_2.ozonesecure_default, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0} datanode_3 | java.io.IOException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: Network closed for unknown reason datanode_3 | at org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:92) datanode_3 | at org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:218) datanode_3 | at org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:179) datanode_3 | at org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:96) datanode_3 | at org.apache.ratis.client.impl.BlockingImpl.sendRequest(BlockingImpl.java:130) datanode_3 | at org.apache.ratis.client.impl.GroupManagementImpl.add(GroupManagementImpl.java:51) datanode_3 | at org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CreatePipelineCommandHandler.lambda$handle$1(CreatePipelineCommandHandler.java:103) ... datanode_2 | 2021-12-02 14:18:59,946 [grpc-default-executor-1] WARN server.GrpcServerProtocolService: e47c6feb-bd55-4d55-8d62-26a4d879dfe3: Failed requestVote 38e348be-d91a-42bf-a2c9-2534a3e6c0fb->e47c6feb-bd55-4d55-8d62-26a4d879dfe3#0 datanode_2 | org.apache.ratis.protocol.exceptions.GroupMismatchException: e47c6feb-bd55-4d55-8d62-26a4d879dfe3: group-EA595E16CBB7 not found. datanode_2 | at org.apache.ratis.server.impl.RaftServerProxy$ImplMap.get(RaftServerProxy.java:147) datanode_2 | at org.apache.ratis.server.impl.RaftServerProxy.getImplFuture(RaftServerProxy.java:339) datanode_2 | at org.apache.ratis.server.impl.RaftServerProxy.getImpl(RaftServerProxy.java:348) datanode_2 | at org.apache.ratis.server.impl.RaftServerProxy.getImpl(RaftServerProxy.java:343) datanode_2 | at org.apache.ratis.server.impl.RaftServerProxy.requestVote(RaftServerProxy.java:548) datanode_2 | at org.apache.ratis.grpc.server.GrpcServerProtocolService.requestVote(GrpcServerProtocolService.java:172) ``` then verified the fix: ``` SCM is out of safe mode. ... Start freon testing | PASS | ``` CI: https://github.com/adoroszlai/hadoop-ozone/actions/runs/1531120597 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
