[ 
https://issues.apache.org/jira/browse/IOTDB-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

刘珍 reopened IOTDB-3862:
-----------------------

依赖最新ratis打包,基于master ba7bd5d0fd1a97807549bdb579c2557a3fbabbd9
问题还存在:
1. 192.168.130.3/4/5  3 副本3节点
bm在ip2 启动运行
2. 创建元数据成功,开始写入数据,扩容datanode  ip1

3. ip5上分布有2个dataregion的leader(共3个dataregion)
缩容ip5,ip5的error log,{color:#DE350B}缩容后还会往ip5发查询语句?{color}
2022-09-13 16:44:33,412 
[192.168.130.5_40010@group-000100000005-SegmentedRaftLogWorker-flush-thread1] 
ERROR o.a.r.s.r.s.BufferedWriteChannel:115 - Failed to flush channel
java.nio.channels.ClosedChannelException: null
        at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110)
        at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:379)
        at 
org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.fileChannelForce(BufferedWriteChannel.java:113)
        at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2022-09-13 16:44:57,375 [20220913_084457_65466_5.1.0-142] ERROR 
o.a.i.d.m.p.s.FragmentInstanceDispatcherImpl:166 - Execute FragmentInstance 
failed: The consensus group SchemaRegion[1] doesn't exist 
2022-09-13 16:44:57,597 [pool-22-IoTDB-ClientRPC-Processor-37] WARN  
o.a.i.d.u.ErrorHandlingUtils:62 -{color:#DE350B} Status code: 
INTERNAL_SERVER_ERROR(500), operation: "SELECT last s_50 FROM 
root.test.g2_0.d_8". executeStatement failed {color}

2022-09-13 16:45:30,439 [pool-18-IoTDB-DataNodeInternalRPC-Processor-94] ERROR 
o.a.i.d.s.t.i.DataNodeInternalRPCServiceImpl:628 - {color:#DE350B}*change 
region DataRegion[3] leader failed
org.apache.iotdb.consensus.exception.RatisRequestFailedException: Ratis request 
failed*{color}
        at 
org.apache.iotdb.consensus.ratis.RatisConsensus.transferLeader(RatisConsensus.java:526)
        at 
org.apache.iotdb.db.service.thrift.impl.DataNodeInternalRPCServiceImpl.transferLeader(DataNodeInternalRPCServiceImpl.java:619)
        at 
org.apache.iotdb.db.service.thrift.impl.DataNodeInternalRPCServiceImpl.changeRegionLeader(DataNodeInternalRPCServiceImpl.java:612)
        at 
org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$changeRegionLeader.getResult(IDataNodeRPCService.java:2643)
        at 
org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$changeRegionLeader.getResult(IDataNodeRPCService.java:2623)
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
        at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:248)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
exception
        at org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:92)
        at 
org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:233)
        at 
org.apache.ratis.grpc.client.GrpcClientProtocolClient.setConfiguration(GrpcClientProtocolClient.java:199)
        at 
org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:102)
        at 
org.apache.ratis.client.impl.BlockingImpl.sendRequest(BlockingImpl.java:139)
        at 
org.apache.ratis.client.impl.BlockingImpl.sendRequestWithRetry(BlockingImpl.java:104)
        at 
org.apache.ratis.client.impl.AdminImpl.setConfiguration(AdminImpl.java:46)
        at 
org.apache.ratis.client.api.AdminApi.setConfiguration(AdminApi.java:51)
        at 
org.apache.ratis.client.api.AdminApi.setConfiguration(AdminApi.java:40)
        at 
org.apache.iotdb.consensus.ratis.RatisConsensus.transferLeader(RatisConsensus.java:516)
        ... 10 common frames omitted
Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
        at 
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271)
        at 
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252)
        at 
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165)
        at 
org.apache.ratis.proto.grpc.AdminProtocolServiceGrpc$AdminProtocolServiceBlockingStub.setConfiguration(AdminProtocolServiceGrpc.java:490)
        at 
org.apache.ratis.grpc.client.GrpcClientProtocolClient.lambda$setConfiguration$6(GrpcClientProtocolClient.java:201)
        at 
org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:231)
        ... 18 common frames omitted
Caused by: 
org.apache.ratis.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 finishConnect(..) failed: Connection refused: /192.168.130.3:40010
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection 
refused
        at 
org.apache.ratis.thirdparty.io.netty.channel.unix.Errors.newConnectException0(Errors.java:155)
        at 
org.apache.ratis.thirdparty.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128)
        at 
org.apache.ratis.thirdparty.io.netty.channel.unix.Socket.finishConnect(Socket.java:321)
        at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)
        at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687)
        at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
        at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:477)
        at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:385)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995)
        at 
org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)


2022-09-13 16:45:47,956 [pool-18-IoTDB-DataNodeInternalRPC-Processor-94] ERROR 
o.a.i.c.s.ThriftService:168 - IoTDB: {color:#DE350B}*close MPP Data exchange 
manager failed because:
java.lang.InterruptedException: null*{color}
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
        at 
org.apache.iotdb.commons.service.ThriftService.stopService(ThriftService.java:163)
        at 
org.apache.iotdb.commons.service.ThriftService.stop(ThriftService.java:74)
        at 
org.apache.iotdb.db.mpp.execution.exchange.MPPDataExchangeService.stop(MPPDataExchangeService.java:130)
        at 
org.apache.iotdb.commons.service.IService.waitAndStop(IService.java:36)
        at 
org.apache.iotdb.commons.service.RegisterManager.deregisterAll(RegisterManager.java:60)
        at org.apache.iotdb.db.service.DataNode.deactivate(DataNode.java:454)
        at org.apache.iotdb.db.service.DataNode.stop(DataNode.java:409)
        at 
org.apache.iotdb.db.service.thrift.impl.DataNodeInternalRPCServiceImpl.stopDataNode(DataNodeInternalRPCServiceImpl.java:859)
        at 
org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$stopDataNode.getResult(IDataNodeRPCService.java:2793)
        at 
org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$stopDataNode.getResult(IDataNodeRPCService.java:2773)
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
        at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:248)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

2022-09-13 16:45:47,999 [pool-18-IoTDB-DataNodeInternalRPC-Processor-94] ERROR 
o.a.i.d.r.AbstractPoolManager:49 - Interrupted while waiting flush sub task 
thread pool to exit.
java.lang.InterruptedException: null
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2067)
        at 
java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
        at 
org.apache.iotdb.db.rescon.AbstractPoolManager.close(AbstractPoolManager.java:44)
        at 
org.apache.iotdb.db.rescon.AbstractPoolManager.stop(AbstractPoolManager.java:96)
        at 
org.apache.iotdb.db.engine.flush.pool.FlushSubTaskPoolManager.stop(FlushSubTaskPoolManager.java:63)
        at 
org.apache.iotdb.db.engine.flush.FlushManager.stop(FlushManager.java:63)
        at 
org.apache.iotdb.commons.service.IService.waitAndStop(IService.java:36)
        at 
org.apache.iotdb.commons.service.RegisterManager.deregisterAll(RegisterManager.java:60)
        at org.apache.iotdb.db.service.DataNode.deactivate(DataNode.java:454)
        at org.apache.iotdb.db.service.DataNode.stop(DataNode.java:409)
        at 
org.apache.iotdb.db.service.thrift.impl.DataNodeInternalRPCServiceImpl.stopDataNode(DataNodeInternalRPCServiceImpl.java:859)
        at 
org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$stopDataNode.getResult(IDataNodeRPCService.java:2793)
        at 
org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$stopDataNode.getResult(IDataNodeRPCService.java:2773)
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
        at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:248)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2022-09-13 16:45:48,026 [pool-18-IoTDB-DataNodeInternalRPC-Processor-94] ERROR 
o.a.ratis.grpc.GrpcUtil:219 - Unexpected exception while waiting for channel 
termination
java.lang.InterruptedException: null
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ManagedChannelImpl.awaitTermination(ManagedChannelImpl.java:905)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ForwardingManagedChannel.awaitTermination(ForwardingManagedChannel.java:57)
        at 
org.apache.ratis.grpc.GrpcUtil.shutdownManagedChannel(GrpcUtil.java:215)
        at 
org.apache.ratis.grpc.client.GrpcClientProtocolClient.close(GrpcClientProtocolClient.java:172)
        at org.apache.ratis.util.PeerProxyMap.closeProxy(PeerProxyMap.java:175)
        at 
org.apache.ratis.util.PeerProxyMap.lambda$null$3(PeerProxyMap.java:169)
        at java.util.Optional.ifPresent(Optional.java:159)
        at 
org.apache.ratis.util.PeerProxyMap.lambda$close$4(PeerProxyMap.java:169)
        at 
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
        at 
java.util.concurrent.ConcurrentHashMap$ValueSpliterator.forEachRemaining(ConcurrentHashMap.java:3566)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
        at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
        at 
java.util.concurrent.ForkJoinPool.helpComplete(ForkJoinPool.java:1881)
        at 
java.util.concurrent.ForkJoinPool.externalHelpComplete(ForkJoinPool.java:2478)
        at 
java.util.concurrent.ForkJoinTask.externalAwaitDone(ForkJoinTask.java:324)
        at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:405)
        at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
        at 
java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
        at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
        at 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
        at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583)
        at org.apache.ratis.util.PeerProxyMap.close(PeerProxyMap.java:168)
        at 
org.apache.ratis.client.impl.RaftClientRpcWithProxy.close(RaftClientRpcWithProxy.java:53)
        at 
org.apache.ratis.client.impl.RaftClientImpl.close(RaftClientImpl.java:406)
        at 
org.apache.iotdb.consensus.ratis.RatisClient.close(RatisClient.java:62)
        at 
org.apache.iotdb.consensus.ratis.RatisClient$Factory.destroyObject(RatisClient.java:91)
        at 
org.apache.iotdb.consensus.ratis.RatisClient$Factory.destroyObject(RatisClient.java:74)
        at 
org.apache.commons.pool2.KeyedPooledObjectFactory.destroyObject(KeyedPooledObjectFactory.java:132)
        at 
org.apache.commons.pool2.impl.GenericKeyedObjectPool.destroy(GenericKeyedObjectPool.java:875)
        at 
org.apache.commons.pool2.impl.GenericKeyedObjectPool.clear(GenericKeyedObjectPool.java:607)
        at 
java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649)
        at 
org.apache.commons.pool2.impl.GenericKeyedObjectPool.clear(GenericKeyedObjectPool.java:582)
        at 
org.apache.commons.pool2.impl.GenericKeyedObjectPool.close(GenericKeyedObjectPool.java:692)
        at 
org.apache.iotdb.commons.client.ClientManager.close(ClientManager.java:103)
        at 
org.apache.iotdb.consensus.ratis.RatisConsensus.stop(RatisConsensus.java:153)
        at org.apache.iotdb.db.service.DataNode.stop(DataNode.java:413)
        at 
org.apache.iotdb.db.service.thrift.impl.DataNodeInternalRPCServiceImpl.stopDataNode(DataNodeInternalRPCServiceImpl.java:859)
        at 
org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$stopDataNode.getResult(IDataNodeRPCService.java:2793)
        at 
org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$stopDataNode.getResult(IDataNodeRPCService.java:2773)
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
        at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:248)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)


> [shrink + Region-Migrate ] [pool-16-IoTDB-Region-Migrate-Pool-1] ERROR 
> o.a.i.d.s.RegionMigrateService$RegionMigrateTask:349 - add new peer 
> TEndPoint(ip:a.b.c.d, port:40010) for region DataRegion[x] failed
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: IOTDB-3862
>                 URL: https://issues.apache.org/jira/browse/IOTDB-3862
>             Project: Apache IoTDB
>          Issue Type: Bug
>          Components: mpp-cluster
>    Affects Versions: 0.14.0-SNAPSHOT
>            Reporter: 刘珍
>            Assignee: Song Ziyang
>            Priority: Major
>         Attachments: expand-ip1_datanode_logs.tar.gz, ip3.sh, 
> ip3_config.properties, ip4.sh, ip4_config.properties, ip5.sh, 
> ip5_config.properties, shrink-ip3_confignode_log_all.tar.gz, 
> shrink-ip3_datanode_log_all.tar.gz
>
>
> master_0718_967cde6
> RatisConsensus ,3 副本3C3D
> 先启动3C3D集群(ip3/4/5),扩容1个datanode(ip1), 在ip5执行缩容ip3的操作。
> ip3的datanode有error日志:
> 2022-07-18 {color:red}*14:00:00*{color},830 
> [pool-16-IoTDB-Region-Migrate-Pool-1] ERROR 
> o.a.i.d.s.RegionMigrateService$RegionMigrateTask:349 - add new peer 
> TEndPoint(ip:192.168.130.1, port:40010){color:red}* for region DataRegion[5] 
> failed*{color}
> 扩容节点ip1 有warn
> 2022-07-18 {color:red}*13:57:06*{color},554 [grpc-default-executor-1] WARN  
> o.a.ratis.util.LogUtils:122 - 192.168.130.1_40010: 
> {color:red}*installSnapshot onError,*{color} lastRequest: 
> 192.168.130.4_40010->192.168.130.1_40010#20-t1,previous=(t:1, 
> i:1790),leaderCommit=1893,initializing? false,entries: size=103, first=(t:1, 
> i:1791), STATEMACHINELOGENTRY, 942@client-A8FC5FC601AC: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client 
> cancelled
> 缩容失败,可以连缩容datanode执行操作。
> 复现流程:
> 1. 启动3C3D 192.168.130.3/4/5  16C32G * 3
> schema_region_consensus_protocol_class=org.apache.iotdb.consensus.ratis.RatisConsensus
> data_region_consensus_protocol_class=org.apache.iotdb.consensus.ratis.RatisConsensus
> schema_replication_factor=3
> data_replication_factor=3
> MAX_HEAP_SIZE="16G"
> 2. 启动3个benchmark分别连3个datanode
> benchmark在192.168.130.2
> /home/benchmark/bm_0620_7ec96c1
> 配置文件见附件  , 启动命令
> nohup sh -x ip3.sh > 3.log &
> nohup sh -x ip4.sh > 4.log &
> nohup sh -x ip5.sh > 5.log &
> 运行约10分钟,ip5执行缩容操作,缩容ip为192.168.130.3 :
> 2022-07-18 *13:56:40*,902 [main] INFO  o.a.i.db.service.DataNode:201 - Remove 
> result TDataNodeRemoveResp(status:TSStatus(code:200, message:Server accept 
> the request)) 
> 3. iotdb的日志见附件



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to