[ https://issues.apache.org/jira/browse/HDDS-1753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16883727#comment-16883727 ]
Shashikant Banerjee commented on HDDS-1753: ------------------------------------------- The issue being caused here is as data is still to be replicated to the followers via leader, as a result of key delete , a block in a closed container can get deleted on the leader. When the follower asks for the chunk data from the leader, it fails as the chunk file does not exist in the leader. The solution being proposed here is as follows: Whenever a delete command gets received on a datanode from SCM, it should first check the min replicated index across all the servers in the pipeline. ContainerStateMachine will also track, the close container log index for each cotainer. Now, if the min replicated index >= close container index in the leader, a delete operation will be queued over Ratis in the leader and same will be ignored in the follower and now delete will happen over Ratis. In case, close container index is not replicated, delete transaction will never be enqueued over Ratis and ignored. SCM already has a retry policy in place to retry the same delete. In case, the Ratis pipeline does not exist, delete will work as is. > Datanode unable to find chunk while replication data using ratis. > ----------------------------------------------------------------- > > Key: HDDS-1753 > URL: https://issues.apache.org/jira/browse/HDDS-1753 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode > Affects Versions: 0.4.0 > Reporter: Mukul Kumar Singh > Assignee: Shashikant Banerjee > Priority: Major > Labels: MiniOzoneChaosCluster > > Leader datanode is unable to read chunk from the datanode while replicating > data from leader to follower. > Please note that deletion of keys is also happening while the data is being > replicated. > {code} > 2019-07-02 19:39:22,604 INFO impl.RaftServerImpl > (RaftServerImpl.java:checkInconsistentAppendEntries(972)) - > 5ac88709-a3a2-4c8f-91de-5e54b617f05e: inconsistency entries. > Reply:76a3eb0f-d7cd-477b-8973-db1 > 014feb398<-5ac88709-a3a2-4c8f-91de-5e54b617f05e#70:FAIL,INCONSISTENCY,nextIndex:9771,term:2,followerCommit:9782 > 2019-07-02 19:39:22,605 ERROR impl.ChunkManagerImpl > (ChunkUtils.java:readData(161)) - Unable to find the chunk file. chunk info : > ChunkInfo{chunkName='76ec669ae2cb6e10dd9f08c0789c5fdf_stream_a2850dce-def3 > -4d64-93d8-fa2ebafee933_chunk_1, offset=0, len=2048} > 2019-07-02 19:39:22,605 INFO impl.RaftServerImpl > (RaftServerImpl.java:checkInconsistentAppendEntries(990)) - > 5ac88709-a3a2-4c8f-91de-5e54b617f05e: Failed appendEntries as latest snapshot > (9770) already h > as the append entries (first index: 1) > 2019-07-02 19:39:22,605 INFO impl.RaftServerImpl > (RaftServerImpl.java:checkInconsistentAppendEntries(972)) - > 5ac88709-a3a2-4c8f-91de-5e54b617f05e: inconsistency entries. > Reply:76a3eb0f-d7cd-477b-8973-db1 > 014feb398<-5ac88709-a3a2-4c8f-91de-5e54b617f05e#71:FAIL,INCONSISTENCY,nextIndex:9771,term:2,followerCommit:9782 > 2019-07-02 19:39:22,605 INFO keyvalue.KeyValueHandler > (ContainerUtils.java:logAndReturnError(146)) - Operation: ReadChunk : Trace > ID: 4216d461a4679e17:4216d461a4679e17:0:0 : Message: Unable to find the c > hunk file. chunk info > ChunkInfo{chunkName='76ec669ae2cb6e10dd9f08c0789c5fdf_stream_a2850dce-def3-4d64-93d8-fa2ebafee933_chunk_1, > offset=0, len=2048} : Result: UNABLE_TO_FIND_CHUNK > 2019-07-02 19:39:22,605 INFO impl.RaftServerImpl > (RaftServerImpl.java:checkInconsistentAppendEntries(990)) - > 5ac88709-a3a2-4c8f-91de-5e54b617f05e: Failed appendEntries as latest snapshot > (9770) already h > as the append entries (first index: 2) > 2019-07-02 19:39:22,606 INFO impl.RaftServerImpl > (RaftServerImpl.java:checkInconsistentAppendEntries(972)) - > 5ac88709-a3a2-4c8f-91de-5e54b617f05e: inconsistency entries. > Reply:76a3eb0f-d7cd-477b-8973-db1 > 014feb398<-5ac88709-a3a2-4c8f-91de-5e54b617f05e#72:FAIL,INCONSISTENCY,nextIndex:9771,term:2,followerCommit:9782 > 19:39:22.606 [pool-195-thread-19] ERROR DNAudit - user=null | ip=null | > op=READ_CHUNK {blockData=conID: 3 locID: 102372189549953034 bcsId: 0} | > ret=FAILURE > java.lang.Exception: Unable to find the chunk file. chunk info > ChunkInfo{chunkName='76ec669ae2cb6e10dd9f08c0789c5fdf_stream_a2850dce-def3-4d64-93d8-fa2ebafee933_chunk_1, > offset=0, len=2048} > at > org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:320) > ~[hadoop-hdds-container-service-0.5.0-SNAPSHOT.jar:?] > at > org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:148) > ~[hadoop-hdds-container-service-0.5.0-SNAPSHOT.jar:?] > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(ContainerStateMachine.java:346) > ~[hadoop-hdds-container-service-0.5.0-SNAPSHOT.jar:?] > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.readStateMachineData(ContainerStateMachine.java:476) > ~[hadoop-hdds-container-service-0.5.0-SNAPSHOT.jar:?] > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$getCachedStateMachineData$2(ContainerStateMachine.java:495) > ~[hadoop-hdds-container-service-0.5.0-SN > APSHOT.jar:?] > at > com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4767) > ~[guava-11.0.2.jar:?] > at > com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) > ~[guava-11.0.2.jar:?] > at > com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) > ~[guava-11.0.2.jar:?] > at > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) > ~[guava-11.0.2.jar:?] > at > com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) > ~[guava-11.0.2.jar:?] > at com.google.common.cache.LocalCache.get(LocalCache.java:3965) > ~[guava-11.0.2.jar:?] > at > com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4764) > ~[guava-11.0.2.jar:?] > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.getCachedStateMachineData(ContainerStateMachine.java:494) > ~[hadoop-hdds-container-service-0.5.0-SNAPSHOT.ja > r:?] > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$readStateMachineData$4(ContainerStateMachine.java:542) > ~[hadoop-hdds-container-service-0.5.0-SNAPSHOT.jar:?] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) > [?:1.8.0_171] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_171] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_171] > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org