[ 
https://issues.apache.org/jira/browse/IOTDB-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17603547#comment-17603547
 ] 

Jinrui Zhang commented on IOTDB-4380:
-------------------------------------

[~SpriCoder] Let's confirm whether following scenario would happen or not.

 

The safeIndex is 2958 and the entry whose index is less than 2958 are all 
deleted in WAL files.

 

If yes, this bug would happen

> The log dispatcher is inconsistent with the search index of the wal node
> ------------------------------------------------------------------------
>
>                 Key: IOTDB-4380
>                 URL: https://issues.apache.org/jira/browse/IOTDB-4380
>             Project: Apache IoTDB
>          Issue Type: Bug
>          Components: mpp-cluster
>    Affects Versions: 0.14.0-SNAPSHOT
>            Reporter: 刘珍
>            Assignee: Haiming Zhu
>            Priority: Major
>         Attachments: ip74_logs.tar.gz, more_metadata.conf
>
>
> m_0908_7915b3f。
> 问题描述
> {color:#DE350B}*log dispatcher与wal node的search index不一致 , 
> datanode重启成功后日志一直刷*{color}:
> 2022-09-09 16:32:00,011 [pool-33-IoTDB-LogDispatcher-DataRegion[12]-2] INFO  
> o.a.i.d.w.n.WALNode$PlanNodeIterator:695 - timeout when waiting for next WAL 
> entry ready, execute rollWALFile. {color:#DE350B}*Current search index in wal 
> buffer is 2959, and next target index is 2501 *{color}
> MultiLeaderConsensus,3副本3节点
> 1. 创建元数据过程中,kill ip74 的datanode PID
> benchmark配置文件见附件。
> 2. 清空ip74 的操作系统缓存,启动ip74的datanode
> 3. 再次重新运行benchmark同一配置,IS_DELETE_DATA=true
> 这个参数为true,会先执行delete storage group root.test.*;
> benchmark运行完成,stop ip74的datanode服务
> 备份data 为/data/mpp_test/m_0908_7915b3f/datanode/data_for_recovery_Test
> 4. 清ip74操作系统缓存,启动datanode服务
> 再次运行benchmark同一配置,benchmark运行完成,
> 查看ip74的日志,看到
> 2022-09-09 15:43:13,691 [pool-23-IoTDB-MPPDataExchangeRPC-Processor-40] ERROR 
> o.a.t.ProcessFunction:47 - Internal error processing getDataBlock
> org.apache.thrift.TException: Source fragment instance not found. Fragment 
> instance ID: TFragmentInstanceId(queryId:20220909_074205_19400_3, 
> fragmentId:2, instanceId:0).
>         at 
> org.apache.iotdb.db.mpp.execution.exchange.MPPDataExchangeManager$MPPDataExchangeServiceImpl.getDataBlock(MPPDataExchangeManager.java:90)
>         at 
> org.apache.iotdb.mpp.rpc.thrift.MPPDataExchangeService$Processor$getDataBlock.getResult(MPPDataExchangeService.java:326)
>         at 
> org.apache.iotdb.mpp.rpc.thrift.MPPDataExchangeService$Processor$getDataBlock.getResult(MPPDataExchangeService.java:306)
>         at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
>         at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
>         at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:248)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834)
> 2022-09-09 15:43:15,312 [20220909_074205_19400_3.2.0.SinkHandle-3074] ERROR 
> o.a.i.d.m.e.e.SinkHandle:281 - The TsBlock doesn't exist. Sequence ID is 1, 
> remaining map is 
> [0=<org.apache.iotdb.tsfile.read.common.block.TsBlock@5f617979,1048576>]
> 2022-09-09 15:43:17,119 [pool-23-IoTDB-MPPDataExchangeRPC-Processor-22] ERROR 
> o.a.t.ProcessFunction:47 - Internal error processing getDataBlock
> java.lang.IllegalStateException: The data block doesn't exist. Sequence ID: 1
>         at 
> org.apache.iotdb.db.mpp.execution.exchange.SinkHandle.getSerializedTsBlock(SinkHandle.java:285)
>         at 
> org.apache.iotdb.db.mpp.execution.exchange.MPPDataExchangeManager$MPPDataExchangeServiceImpl.getDataBlock(MPPDataExchangeManager.java:97)
>         at 
> org.apache.iotdb.mpp.rpc.thrift.MPPDataExchangeService$Processor$getDataBlock.getResult(MPPDataExchangeService.java:326)
>         at 
> org.apache.iotdb.mpp.rpc.thrift.MPPDataExchangeService$Processor$getDataBlock.getResult(MPPDataExchangeService.java:306)
>         at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
>         at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
>         at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:248)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834)
> 5. 停止ip74的datanode服务
> 备份data 到/data/mpp_test/m_0908_7915b3f/datanode/data_for_recovery_Test_2
> 清ip74操作系统缓存,启动ip74的datanode ,失败:
> 2022-09-09 16:44:00,039 [pool-33-IoTDB-LogDispatcher-DataRegion[12]-2] INFO  
> o.a.i.d.w.n.WALNode$PlanNodeIterator:695 - timeout when waiting for next WAL 
> entry ready, execute rollWALFile. Current search index in wal buffer is 2959, 
> and next target index is 2501 
> 机器 与 集群配置
> 1. 192.168.10.72/ 73/74 48核384G
> benchmark 在71
> 2. 集群参数
> confignode
> MAX_HEAP_SIZE="8G"
> schema_region_consensus_protocol_class=org.apache.iotdb.consensus.ratis.RatisConsensus
> data_region_consensus_protocol_class=org.apache.iotdb.consensus.multileader.MultiLeaderConsensus
> schema_replication_factor=3
> data_replication_factor=3
> datanode
> MAX_HEAP_SIZE="256G"
> MAX_DIRECT_MEMORY_SIZE="32G"
> max_connection_for_internal_service=300
> enable_timed_flush_seq_memtable=true
> seq_memtable_flush_interval_in_ms=600000
> seq_memtable_flush_check_interval_in_ms=300000
> enable_timed_flush_unseq_memtable=true
> unseq_memtable_flush_interval_in_ms=600000
> unseq_memtable_flush_check_interval_in_ms=300000
> max_waiting_time_when_insert_blocked=3600000
> query_timeout_threshold=3600000



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to