[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962159#comment-16962159
 ] 

Marton Elek commented on HDDS-2372:
-----------------------------------

I tried to reproduce it locally with docker-compose. In container state machine 
I reduced the capacity of the cache:
{code:java}
stateMachineDataCache = CacheBuilder.newBuilder()
    .expireAfterAccess(500, TimeUnit.MILLISECONDS)
    // set the limit on no of cached entries equal to no of max threads
    // executing writeStateMachineData
    .maximumSize(10).build();

{code}
And added a random wait to the readStateMachine:
{code:java}
private ByteString readStateMachineData(
    ContainerCommandRequestProto requestProto, long term, long index)
    throws IOException {
  if (Math.random() > 0.7) {
    try {
      Thread.sleep(100);
    } catch (InterruptedException e) {
      e.printStackTrace();
    }
  } {code}
I got similar, but different error:
{code:java}
-SegmentedRaftLogWorker: created new log segment 
/data/metadata/ratis/68c226d2-356c-4eb0-aee2-ce458d4b0095/current/log_inprogress_6872
ESC[32mdatanode_3    |ESC[0m 2019-10-29 15:54:10,084 [pool-7-thread-38] ERROR   
   - Unable to find the chunk file. chunk info : 
ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366, offset=0, len=1024}
ESC[32mdatanode_3    |ESC[0m 2019-10-29 15:54:10,085 [pool-7-thread-38] INFO    
   - Operation: ReadChunk : Trace ID: 
b93bcdcdd7fd37c:a3bed642046e9e09:b93bcdcdd7fd37c:1 : Message: Unable to find 
the chunk file. chunk info ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366, 
offset=0, len=1024} : Result: UNABLE_TO_FIND_CHUNK
ESC[32mdatanode_3    |ESC[0m 2019-10-29 15:54:10,085 [pool-7-thread-38] ERROR   
   - gid group-CE458D4B0095 : ReadStateMachine failed. cmd ReadChunk logIndex 
8773 msg : Unable to find the chunk file. chunk info 
ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366, offset=0, len=1024} 
Container Result: UNABLE_TO_FIND_CHUNK
ESC[32mdatanode_3    |ESC[0m 2019-10-29 15:54:10,086 ERROR raftlog.RaftLog: 
06f4231d-30a8-42fd-839e-aeaea7b1aa72@group-CE458D4B0095-SegmentedRaftLog: 
Failed readStateMachineData for (t:2, i:8773), STATEMACHINELOGENTRY, 
client-BCA58E609475, cid=4367
ESC[32mdatanode_3    |ESC[0m java.util.concurrent.ExecutionException: 
java.util.concurrent.ExecutionException: 
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: 
Unable to find the chunk file. chunk info 
ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366, offset=0, len=1024}
ESC[32mdatanode_3    |ESC[0m    at 
java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
ESC[32mdatanode_3    |ESC[0m    at 
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2022)
ESC[32mdatanode_3    |ESC[0m    at 
org.apache.ratis.server.raftlog.RaftLog$EntryWithData.getEntry(RaftLog.java:472)
ESC[32mdatanode_3    |ESC[0m    at 
org.apache.ratis.util.DataQueue.pollList(DataQueue.java:134)
ESC[32mdatanode_3    |ESC[0m    at 
org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:220)
ESC[32mdatanode_3    |ESC[0m    at 
org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:178)
ESC[32mdatanode_3    |ESC[0m    at 
org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:121)
ESC[32mdatanode_3    |ESC[0m    at 
org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:76)
ESC[32mdatanode_3    |ESC[0m    at 
java.base/java.lang.Thread.run(Thread.java:834) {code}
And the cluster is stuck in a bad state (couln't write any more chunk, ever)
{code:java}
datanode_1    | 2019-10-29 15:54:10,099 INFO impl.RaftServerImpl: 
6b9ca1af-467f-40c7-a21d-118cb34080b1@group-CE458D4B0095: inconsistency entries. 
Reply:06f4231d-30a8-42fd-839e-aeaea7b1aa72<-6b9ca1af-467f-40c7-a21d-118cb34080b1#0:FAIL,INCONSISTENCY,nextIndex:8773,term:2,followerCommit:8768
 {code}
Fix me If I am wrong, but
 * I think the write path should work even if the cache is limited or there are 
unexpected sleep
 * If there are some inconsistencies the raft ring should be healed or closed 
and reopened (but it's an independent issue)

> Datanode pipeline is failing with NoSuchFileException
> -----------------------------------------------------
>
>                 Key: HDDS-2372
>                 URL: https://issues.apache.org/jira/browse/HDDS-2372
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>            Reporter: Marton Elek
>            Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to