刘珍 created IOTDB-4553: ------------------------- Summary: [remove datanode ] SchemaRegion migration failed Key: IOTDB-4553 URL: https://issues.apache.org/jira/browse/IOTDB-4553 Project: Apache IoTDB Issue Type: Bug Components: mpp-cluster Affects Versions: 0.14.0-SNAPSHOT Reporter: 刘珍 Assignee: Song Ziyang Attachments: image-2022-09-28-18-03-13-622.png
master_0928_e5cc456 SchemaRegion : ratis DataRegion : multiLeader 均为3副本,先启动3C3D,bm写入数据,增加1个datanode ip40,缩容ip39, ip39 缩容成功后,SchemaRegion 迁移失败 !image-2022-09-28-18-03-13-622.png! ip40的datanode error 2022-09-28 17:37:55,449 [pool-21-IoTDB-DataNodeInternalRPC-Processor-3] ERROR o.a.i.d.s.t.i.DataNodeInternalRPCServiceImpl:1002 - CreateNewRegionPeer error, peers: [Peer{groupId=SchemaRegion[0], endpoint=TEndPoint(ip:172.20.70.37, port:50010)}, Peer{groupId=SchemaRegion[0], endpoint=TEndPoint(ip:172.20.70.38, port:50010)}, Peer{groupId=SchemaRegion[0], endpoint=TEndPoint(ip:172.20.70.39, port:50010)}, Peer{groupId=SchemaRegion[0], endpoint=TEndPoint(ip:172.20.70.40, port:50010)}], regionId: SchemaRegion[0], errorMessage org.apache.iotdb.consensus.exception.RatisRequestFailedException: Ratis request failed at org.apache.iotdb.consensus.ratis.RatisConsensus.createPeer(RatisConsensus.java:332) at org.apache.iotdb.db.service.thrift.impl.DataNodeInternalRPCServiceImpl.createNewRegionPeer(DataNodeInternalRPCServiceImpl.java:999) at org.apache.iotdb.db.service.thrift.impl.DataNodeInternalRPCServiceImpl.createNewRegionPeer(DataNodeInternalRPCServiceImpl.java:838) at org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$createNewRegionPeer.getResult(IDataNodeRPCService.java:3237) at org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$createNewRegionPeer.getResult(IDataNodeRPCService.java:3217) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:248) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception at org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:92) at org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:234) at org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:181) at org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:98) at org.apache.ratis.client.impl.BlockingImpl.sendRequest(BlockingImpl.java:132) at org.apache.ratis.client.impl.BlockingImpl.sendRequestWithRetry(BlockingImpl.java:98) at org.apache.ratis.client.impl.GroupManagementImpl.add(GroupManagementImpl.java:51) at org.apache.iotdb.consensus.ratis.RatisConsensus.createPeer(RatisConsensus.java:327) ... 10 common frames omitted Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception at org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262) at org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243) at org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156) at org.apache.ratis.proto.grpc.AdminProtocolServiceGrpc$AdminProtocolServiceBlockingStub.groupManagement(AdminProtocolServiceGrpc.java:507) at org.apache.ratis.grpc.client.GrpcClientProtocolClient.lambda$groupAdd$5(GrpcClientProtocolClient.java:183) at org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:232) ... 16 common frames omitted Caused by: org.apache.ratis.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /172.20.70.40:50010 Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused at org.apache.ratis.thirdparty.io.netty.channel.unix.Errors.newConnectException0(Errors.java:155) at org.apache.ratis.thirdparty.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128) at org.apache.ratis.thirdparty.io.netty.channel.unix.Socket.finishConnect(Socket.java:320) at org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710) at org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687) at org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567) at org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470) at org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) at org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) at org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at org.apache.ratis.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) 测试环境 1. 私有云 172.20.70.34..40 8cpu 32GB 34,35,36 是confignode 37..40是datanode ip21上运行benchmark 2. 集群配置参数 ConfigNode MAX_HEAP_SIZE="8G" MAX_DIRECT_MEMORY_SIZE="4G" schema_region_consensus_protocol_class=org.apache.iotdb.consensus.ratis.RatisConsensus data_region_consensus_protocol_class=org.apache.iotdb.consensus.multileader.MultiLeaderConsensus time_partition_interval_for_routing=86400000 schema_replication_factor=3 schema_replication_factor=3 DataNode MAX_HEAP_SIZE="20G" MAX_DIRECT_MEMORY_SIZE="6G" wal_buffer_size_in_byte=1048576 enable_timed_flush_seq_memtable=true seq_memtable_flush_interval_in_ms=3600000 seq_memtable_flush_check_interval_in_ms=600000 enable_timed_flush_unseq_memtable=true unseq_memtable_flush_interval_in_ms=3600000 unseq_memtable_flush_check_interval_in_ms=600000 query_timeout_threshold=36000000 先启动3C , 34,35,36 再启动3D ,37,38,39 2. bm 配置见附件 3. 启动ip40的datanode 4.bm约运行30分钟,缩容ip39 5.查看缩容结果 日志见附件 -- This message was sent by Atlassian Jira (v8.20.10#820010)