adoroszlai opened a new pull request #2882:
URL: https://github.com/apache/ozone/pull/2882


   ## What changes were proposed in this pull request?
   
   In secure environments, datanode's request to addGroup in other datanodes 
fails with `Network closed for unknown reason`.  Thus pipeline creation 
essentially becomes async, just like before HDDS-2679.
   
   The problem is that Ratis sends group management request via a separate 
admin channel, but currently only real "client" channel is configured for TLS 
when performing the add group request.  We need to configure TLS for talking to 
the admin endpoint, too.
   
   https://issues.apache.org/jira/browse/HDDS-6061
   
   ## How was this patch tested?
   
   Added temporary code to consistently reproduce the problem in `ozonesecure` 
env:
   
   ```
   diff --git 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/CreatePipelineCommandHandler.java
 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/CreatePipelineCommandHandler.java
   index 687b6be06..66709aaf6 100644
   --- 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/CreatePipelineCommandHandler.java
   +++ 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/CreatePipelineCommandHandler.java
   @@ -82,6 +82,10 @@ public void handle(SCMCommand command, OzoneContainer 
ozoneContainer,
        final HddsProtos.PipelineID pipelineIdProto = pipelineID.getProtobuf();
        final List<DatanodeDetails> peers = createCommand.getNodeList();
        final List<Integer> priorityList = createCommand.getPriorityList();
   +    if (dn.getHostName().contains("2")) {
   +      LOG.info("ZZZ ignore create pipeline command from SCM for {}", 
dn.getHostName());
   +      return;
   +    }
   
        try {
          XceiverServerSpi server = ozoneContainer.getWriteChannel();
   ```
   
   and reproduced it:
   
   ```
   datanode_2  | 2021-12-02 14:18:52,840 [Command processor thread] INFO 
commandhandler.CreatePipelineCommandHandler: ZZZ ignore create pipeline command 
from SCM for ozonesecure_datanode_2.ozonesecure_default
   ...
   datanode_3  | 2021-12-02 14:18:54,506 [Command processor thread] INFO 
ratis.XceiverServerRatis: Created group 
PipelineID=ffca2383-1bb9-4bfe-86dd-ea595e16cbb7
   datanode_3  | 2021-12-02 14:18:55,896 [Command processor thread] WARN 
commandhandler.CreatePipelineCommandHandler: Add group failed for 
e47c6feb-bd55-4d55-8d62-26a4d879dfe3{ip: 172.18.0.9, host: 
ozonesecure_datanode_2.ozonesecure_default, ports: [REPLICATION=9886, 
RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], 
networkLocation: /default-rack, certSerialId: null, persistedOpState: 
IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
   datanode_3  | java.io.IOException: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: 
Network closed for unknown reason
   datanode_3  |        at 
org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:92)
   datanode_3  |        at 
org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:218)
   datanode_3  |        at 
org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:179)
   datanode_3  |        at 
org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:96)
   datanode_3  |        at 
org.apache.ratis.client.impl.BlockingImpl.sendRequest(BlockingImpl.java:130)
   datanode_3  |        at 
org.apache.ratis.client.impl.GroupManagementImpl.add(GroupManagementImpl.java:51)
   datanode_3  |        at 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CreatePipelineCommandHandler.lambda$handle$1(CreatePipelineCommandHandler.java:103)
   ...
   datanode_2  | 2021-12-02 14:18:59,946 [grpc-default-executor-1] WARN 
server.GrpcServerProtocolService: e47c6feb-bd55-4d55-8d62-26a4d879dfe3: Failed 
requestVote 
38e348be-d91a-42bf-a2c9-2534a3e6c0fb->e47c6feb-bd55-4d55-8d62-26a4d879dfe3#0
   datanode_2  | org.apache.ratis.protocol.exceptions.GroupMismatchException: 
e47c6feb-bd55-4d55-8d62-26a4d879dfe3: group-EA595E16CBB7 not found.
   datanode_2  |        at 
org.apache.ratis.server.impl.RaftServerProxy$ImplMap.get(RaftServerProxy.java:147)
   datanode_2  |        at 
org.apache.ratis.server.impl.RaftServerProxy.getImplFuture(RaftServerProxy.java:339)
   datanode_2  |        at 
org.apache.ratis.server.impl.RaftServerProxy.getImpl(RaftServerProxy.java:348)
   datanode_2  |        at 
org.apache.ratis.server.impl.RaftServerProxy.getImpl(RaftServerProxy.java:343)
   datanode_2  |        at 
org.apache.ratis.server.impl.RaftServerProxy.requestVote(RaftServerProxy.java:548)
   datanode_2  |        at 
org.apache.ratis.grpc.server.GrpcServerProtocolService.requestVote(GrpcServerProtocolService.java:172)
   ```
   
   then verified the fix:
   
   ```
   SCM is out of safe mode.
   ...
   Start freon testing                                                   | PASS 
|
   ```
   
   CI:
   https://github.com/adoroszlai/hadoop-ozone/actions/runs/1531120597


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to