[ 
https://issues.apache.org/jira/browse/HDDS-1753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16883727#comment-16883727
 ] 

Shashikant Banerjee commented on HDDS-1753:
-------------------------------------------

The issue being caused here is as data is still to be replicated to the 
followers via leader, as a result of key delete , a block in a closed container 
can get deleted on the leader. When the follower asks for the chunk data from 
the leader, it fails as the chunk file does not exist in the leader.

The solution being proposed here is as follows:

Whenever a delete command gets received on a datanode from SCM, it should first 
check the min replicated index across all the servers in the pipeline. 
ContainerStateMachine will also track, the close container log index for each 
cotainer. Now, if the min replicated index >= close container index in the 
leader, a delete operation will be queued over Ratis in the leader and same 
will be ignored in the follower and now delete will happen over Ratis. In case, 
close container index is not replicated, delete transaction will never be 
enqueued over Ratis and ignored. SCM already has a retry policy in place to 
retry the same delete.

In case, the Ratis pipeline does not exist, delete will work as is.

> Datanode unable to find chunk while replication data using ratis.
> -----------------------------------------------------------------
>
>                 Key: HDDS-1753
>                 URL: https://issues.apache.org/jira/browse/HDDS-1753
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: Ozone Datanode
>    Affects Versions: 0.4.0
>            Reporter: Mukul Kumar Singh
>            Assignee: Shashikant Banerjee
>            Priority: Major
>              Labels: MiniOzoneChaosCluster
>
> Leader datanode is unable to read chunk from the datanode while replicating 
> data from leader to follower.
> Please note that deletion of keys is also happening while the data is being 
> replicated.
> {code}
> 2019-07-02 19:39:22,604 INFO  impl.RaftServerImpl 
> (RaftServerImpl.java:checkInconsistentAppendEntries(972)) - 
> 5ac88709-a3a2-4c8f-91de-5e54b617f05e: inconsistency entries. 
> Reply:76a3eb0f-d7cd-477b-8973-db1
> 014feb398<-5ac88709-a3a2-4c8f-91de-5e54b617f05e#70:FAIL,INCONSISTENCY,nextIndex:9771,term:2,followerCommit:9782
> 2019-07-02 19:39:22,605 ERROR impl.ChunkManagerImpl 
> (ChunkUtils.java:readData(161)) - Unable to find the chunk file. chunk info : 
> ChunkInfo{chunkName='76ec669ae2cb6e10dd9f08c0789c5fdf_stream_a2850dce-def3
> -4d64-93d8-fa2ebafee933_chunk_1, offset=0, len=2048}
> 2019-07-02 19:39:22,605 INFO  impl.RaftServerImpl 
> (RaftServerImpl.java:checkInconsistentAppendEntries(990)) - 
> 5ac88709-a3a2-4c8f-91de-5e54b617f05e: Failed appendEntries as latest snapshot 
> (9770) already h
> as the append entries (first index: 1)
> 2019-07-02 19:39:22,605 INFO  impl.RaftServerImpl 
> (RaftServerImpl.java:checkInconsistentAppendEntries(972)) - 
> 5ac88709-a3a2-4c8f-91de-5e54b617f05e: inconsistency entries. 
> Reply:76a3eb0f-d7cd-477b-8973-db1
> 014feb398<-5ac88709-a3a2-4c8f-91de-5e54b617f05e#71:FAIL,INCONSISTENCY,nextIndex:9771,term:2,followerCommit:9782
> 2019-07-02 19:39:22,605 INFO  keyvalue.KeyValueHandler 
> (ContainerUtils.java:logAndReturnError(146)) - Operation: ReadChunk : Trace 
> ID: 4216d461a4679e17:4216d461a4679e17:0:0 : Message: Unable to find the c
> hunk file. chunk info 
> ChunkInfo{chunkName='76ec669ae2cb6e10dd9f08c0789c5fdf_stream_a2850dce-def3-4d64-93d8-fa2ebafee933_chunk_1,
>  offset=0, len=2048} : Result: UNABLE_TO_FIND_CHUNK
> 2019-07-02 19:39:22,605 INFO  impl.RaftServerImpl 
> (RaftServerImpl.java:checkInconsistentAppendEntries(990)) - 
> 5ac88709-a3a2-4c8f-91de-5e54b617f05e: Failed appendEntries as latest snapshot 
> (9770) already h
> as the append entries (first index: 2)
> 2019-07-02 19:39:22,606 INFO  impl.RaftServerImpl 
> (RaftServerImpl.java:checkInconsistentAppendEntries(972)) - 
> 5ac88709-a3a2-4c8f-91de-5e54b617f05e: inconsistency entries. 
> Reply:76a3eb0f-d7cd-477b-8973-db1
> 014feb398<-5ac88709-a3a2-4c8f-91de-5e54b617f05e#72:FAIL,INCONSISTENCY,nextIndex:9771,term:2,followerCommit:9782
> 19:39:22.606 [pool-195-thread-19] ERROR DNAudit - user=null | ip=null | 
> op=READ_CHUNK {blockData=conID: 3 locID: 102372189549953034 bcsId: 0} | 
> ret=FAILURE
> java.lang.Exception: Unable to find the chunk file. chunk info 
> ChunkInfo{chunkName='76ec669ae2cb6e10dd9f08c0789c5fdf_stream_a2850dce-def3-4d64-93d8-fa2ebafee933_chunk_1,
>  offset=0, len=2048}
>         at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:320)
>  ~[hadoop-hdds-container-service-0.5.0-SNAPSHOT.jar:?]
>         at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:148)
>  ~[hadoop-hdds-container-service-0.5.0-SNAPSHOT.jar:?]
>         at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(ContainerStateMachine.java:346)
>  ~[hadoop-hdds-container-service-0.5.0-SNAPSHOT.jar:?]
>         at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.readStateMachineData(ContainerStateMachine.java:476)
>  ~[hadoop-hdds-container-service-0.5.0-SNAPSHOT.jar:?]
>         at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$getCachedStateMachineData$2(ContainerStateMachine.java:495)
>  ~[hadoop-hdds-container-service-0.5.0-SN
> APSHOT.jar:?]
>         at 
> com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4767)
>  ~[guava-11.0.2.jar:?]
>         at 
> com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
>  ~[guava-11.0.2.jar:?]
>         at 
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) 
> ~[guava-11.0.2.jar:?]
>         at 
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
>  ~[guava-11.0.2.jar:?]
>         at 
> com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) 
> ~[guava-11.0.2.jar:?]
>         at com.google.common.cache.LocalCache.get(LocalCache.java:3965) 
> ~[guava-11.0.2.jar:?]
>         at 
> com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4764) 
> ~[guava-11.0.2.jar:?]
>         at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.getCachedStateMachineData(ContainerStateMachine.java:494)
>  ~[hadoop-hdds-container-service-0.5.0-SNAPSHOT.ja
> r:?]
>         at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$readStateMachineData$4(ContainerStateMachine.java:542)
>  ~[hadoop-hdds-container-service-0.5.0-SNAPSHOT.jar:?]
>         at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>  [?:1.8.0_171]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_171]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_171]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to