[ https://issues.apache.org/jira/browse/IOTDB-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641047#comment-17641047 ]
刘珍 commented on IOTDB-4830: --------------------------- rel/1.0 1130_40de3ad 私有云3副本3C5D 1.启动3副本3C5D集群 2.stop ip3的datanode 3.BM写入数据,完成 4.缩容ip3的datanode,缩容成功。 查看ConfigNode Leader的日志: 2022-11-30 10:52:40,849 [ForkJoinPool.commonPool-worker-5] ERROR o.a.i.c.p.TriggerInfo:246 - Failed to take snapshot, because snapshot file [/data/iotdb/r_1130_40de3ad/sbin/../data/confignode/consensus/47474747-4747-4747-4747-000000000000/sm/.tmp.1_20583/trigger_info.bin] is already exist. 2022-11-30 10:54:41,161 [ForkJoinPool.commonPool-worker-1] ERROR o.a.i.c.p.TriggerInfo:246 - Failed to take snapshot, because snapshot file [/data/iotdb/r_1130_40de3ad/sbin/../data/confignode/consensus/47474747-4747-4747-4747-000000000000/sm/.tmp.1_20583/trigger_info.bin] is already exist. 2022-11-30 10:56:41,474 [ForkJoinPool.commonPool-worker-1] ERROR o.a.i.c.p.TriggerInfo:246 - Failed to take snapshot, because snapshot file [/data/iotdb/r_1130_40de3ad/sbin/../data/confignode/consensus/47474747-4747-4747-4747-000000000000/sm/.tmp.1_20583/trigger_info.bin] is already exist. 2022-11-30 10:58:41,789 [ForkJoinPool.commonPool-worker-0] ERROR o.a.i.c.p.TriggerInfo:246 - Failed to take snapshot, because snapshot file [/data/iotdb/r_1130_40de3ad/sbin/../data/confignode/consensus/47474747-4747-4747-4747-000000000000/sm/.tmp.1_20583/trigger_info.bin] is already exist. 2022-11-30 11:00:42,105 [ForkJoinPool.commonPool-worker-6] ERROR o.a.i.c.p.TriggerInfo:246 - Failed to take snapshot, because snapshot file [/data/iotdb/r_1130_40de3ad/sbin/../data/confignode/consensus/47474747-4747-4747-4747-000000000000/sm/.tmp.1_20583/trigger_info.bin] is already exist. 2022-11-30 11:02:42,401 [ForkJoinPool.commonPool-worker-0] ERROR o.a.i.c.p.TriggerInfo:246 - Failed to take snapshot, because snapshot file [/data/iotdb/r_1130_40de3ad/sbin/../data/confignode/consensus/47474747-4747-4747-4747-000000000000/sm/.tmp.1_20583/trigger_info.bin] is already exist. 2022-11-30 11:04:42,686 [0@group-000000000000-StateMachineUpdater] ERROR o.a.i.c.p.TriggerInfo:246 - Failed to take snapshot, because snapshot file [/data/iotdb/r_1130_40de3ad/sbin/../data/confignode/consensus/47474747-4747-4747-4747-000000000000/sm/.tmp.1_20583/trigger_info.bin] is already exist. 2022-11-30 11:06:42,972 [ForkJoinPool.commonPool-worker-5] ERROR o.a.i.c.p.TriggerInfo:246 - Failed to take snapshot, because snapshot file [/data/iotdb/r_1130_40de3ad/sbin/../data/confignode/consensus/47474747-4747-4747-4747-000000000000/sm/.tmp.1_20583/trigger_info.bin] is already exist. 2022-11-30 11:11:48,561 [ProcExecWorker-2] ERROR o.a.i.c.c.s.SyncDataNodeClientPool:97 - {color:#DE350B}SET_SYSTEM_STATUS failed on DataNode TEndPoint(ip:172.20.70.3, port:9003) java.io.IOException: Borrow client from pool for node TEndPoint(ip:172.20.70.3, port:9003) failed, you need to increase dn_max_connection_for_internal_service.{color} at org.apache.iotdb.commons.client.ClientManager.borrowClient(ClientManager.java:64) at org.apache.iotdb.confignode.client.sync.SyncDataNodeClientPool.sendSyncRequestToDataNodeWithGivenRetry(SyncDataNodeClientPool.java:87) at org.apache.iotdb.confignode.procedure.env.ConfigNodeProcedureEnv.markDataNodeAsRemovingAndBroadcast(ConfigNodeProcedureEnv.java:373) at org.apache.iotdb.confignode.procedure.impl.node.RemoveDataNodeProcedure.executeFromState(RemoveDataNodeProcedure.java:86) at org.apache.iotdb.confignode.procedure.impl.node.RemoveDataNodeProcedure.executeFromState(RemoveDataNodeProcedure.java:47) at org.apache.iotdb.confignode.procedure.impl.statemachine.StateMachineProcedure.execute(StateMachineProcedure.java:186) at org.apache.iotdb.confignode.procedure.Procedure.doExecute(Procedure.java:365) at org.apache.iotdb.confignode.procedure.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:414) at org.apache.iotdb.confignode.procedure.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:373) at org.apache.iotdb.confignode.procedure.ProcedureExecutor.access$300(ProcedureExecutor.java:50) at org.apache.iotdb.confignode.procedure.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:741) Caused by: net.sf.cglib.core.CodeGenerationException: org.apache.thrift.transport.TTransportException-->java.net.ConnectException: Connection refused (Connection refused) at net.sf.cglib.core.ReflectUtils.newInstance(ReflectUtils.java:235) at net.sf.cglib.core.ReflectUtils.newInstance(ReflectUtils.java:220) at net.sf.cglib.proxy.Enhancer.createUsingReflection(Enhancer.java:639) at net.sf.cglib.proxy.Enhancer.firstInstance(Enhancer.java:538) at net.sf.cglib.core.AbstractClassGenerator.create(AbstractClassGenerator.java:225) at net.sf.cglib.proxy.Enhancer.createHelper(Enhancer.java:377) at net.sf.cglib.proxy.Enhancer.create(Enhancer.java:304) at org.apache.iotdb.commons.client.sync.SyncThriftClientWithErrorHandler.newErrorHandler(SyncThriftClientWithErrorHandler.java:48) at org.apache.iotdb.commons.client.sync.SyncDataNodeInternalServiceClient$Factory.makeObject(SyncDataNodeInternalServiceClient.java:127) at org.apache.iotdb.commons.client.sync.SyncDataNodeInternalServiceClient$Factory.makeObject(SyncDataNodeInternalServiceClient.java:105) at org.apache.commons.pool2.impl.GenericKeyedObjectPool.create(GenericKeyedObjectPool.java:780) at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:439) at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:350) at org.apache.iotdb.commons.client.ClientManager.borrowClient(ClientManager.java:50) ... 10 common frames omitted Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused) at org.apache.thrift.transport.TSocket.open(TSocket.java:243) at org.apache.iotdb.rpc.TElasticFramedTransport.open(TElasticFramedTransport.java:91) at org.apache.iotdb.commons.client.sync.SyncDataNodeInternalServiceClient.<init>(SyncDataNodeInternalServiceClient.java:63) at org.apache.iotdb.commons.client.sync.SyncDataNodeInternalServiceClient$$EnhancerByCGLIB$$b73d1a05.<init>(<generated>) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at net.sf.cglib.core.ReflectUtils.newInstance(ReflectUtils.java:228) ... 23 common frames omitted Caused by: java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.apache.thrift.transport.TSocket.open(TSocket.java:238) ... 31 common frames omitted > [SchemaRegion migrated failed] remove datanode that has stopped ,confignode > executes “DELETE_OLD_REGION_PEER” on this datanode > ------------------------------------------------------------------------------------------------------------------------------ > > Key: IOTDB-4830 > URL: https://issues.apache.org/jira/browse/IOTDB-4830 > Project: Apache IoTDB > Issue Type: Bug > Components: mpp-cluster > Affects Versions: 0.14.0-SNAPSHOT > Reporter: 刘珍 > Assignee: 陈哲涵 > Priority: Major > Labels: pull-request-available > Fix For: 0.14.0-SNAPSHOT > > Attachments: image-2022-11-02-14-55-28-013.png, > image-2022-11-15-14-35-54-026.png, image-2022-11-15-14-37-38-147.png, > image-2022-11-15-15-10-58-501.png, image-2022-11-15-15-12-05-884.png, > iotdb_4830.conf, screenshot-1.png > > > m_1102_09e2566 > 1. 启动3副本 , 3C5D集群 > 2.调用stop-datanode.sh脚本正常停止ip76的 datanode > 3. benchmark写入数据完成 > 4. 缩容下线的ip76的datanode > confignode 会重试连接ip76 > ,并且有DELETE_OLD_REGION_PEER重试操作,DELETE_OLD_REGION_PEER可以不执行,因为不是缩容开始后的重试 : > 2022-11-02 14:34:23,637 [ProcExecWorker-9] ERROR > o.a.i.c.c.s.SyncDataNodeClientPool:113 - > {color:#DE350B}*DELETE_OLD_REGION_PEER*{color} failed on DataNode > TEndPoint(ip:192.168.10.76, port:9003) > 5. 启动 ip76 datanode , 可以看到remove开始在 ip76上执行 ,但此时此节点的状态却是Running, 应该是Removing。 > ip76 datanode log (已经在执行remove了): > 2022-11-02 14:38:45,611 [pool-53-IoTDB-Region-Migrate-Pool-1] INFO > o.a.i.d.s.RegionMigrateService$DeleteOldRegionPeerTask:493 - succeed to > remove region DataRegion[12] consensus group > 此时集群节点状态: > !image-2022-11-02-14-55-28-013.png! > TEST ENV > 192.168.10.72~76 -- This message was sent by Atlassian Jira (v8.20.10#820010)