[ https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962159#comment-16962159 ]
Marton Elek commented on HDDS-2372: ----------------------------------- I tried to reproduce it locally with docker-compose. In container state machine I reduced the capacity of the cache: {code:java} stateMachineDataCache = CacheBuilder.newBuilder() .expireAfterAccess(500, TimeUnit.MILLISECONDS) // set the limit on no of cached entries equal to no of max threads // executing writeStateMachineData .maximumSize(10).build(); {code} And added a random wait to the readStateMachine: {code:java} private ByteString readStateMachineData( ContainerCommandRequestProto requestProto, long term, long index) throws IOException { if (Math.random() > 0.7) { try { Thread.sleep(100); } catch (InterruptedException e) { e.printStackTrace(); } } {code} I got similar, but different error: {code:java} -SegmentedRaftLogWorker: created new log segment /data/metadata/ratis/68c226d2-356c-4eb0-aee2-ce458d4b0095/current/log_inprogress_6872 ESC[32mdatanode_3 |ESC[0m 2019-10-29 15:54:10,084 [pool-7-thread-38] ERROR - Unable to find the chunk file. chunk info : ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366, offset=0, len=1024} ESC[32mdatanode_3 |ESC[0m 2019-10-29 15:54:10,085 [pool-7-thread-38] INFO - Operation: ReadChunk : Trace ID: b93bcdcdd7fd37c:a3bed642046e9e09:b93bcdcdd7fd37c:1 : Message: Unable to find the chunk file. chunk info ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366, offset=0, len=1024} : Result: UNABLE_TO_FIND_CHUNK ESC[32mdatanode_3 |ESC[0m 2019-10-29 15:54:10,085 [pool-7-thread-38] ERROR - gid group-CE458D4B0095 : ReadStateMachine failed. cmd ReadChunk logIndex 8773 msg : Unable to find the chunk file. chunk info ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366, offset=0, len=1024} Container Result: UNABLE_TO_FIND_CHUNK ESC[32mdatanode_3 |ESC[0m 2019-10-29 15:54:10,086 ERROR raftlog.RaftLog: 06f4231d-30a8-42fd-839e-aeaea7b1aa72@group-CE458D4B0095-SegmentedRaftLog: Failed readStateMachineData for (t:2, i:8773), STATEMACHINELOGENTRY, client-BCA58E609475, cid=4367 ESC[32mdatanode_3 |ESC[0m java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: Unable to find the chunk file. chunk info ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366, offset=0, len=1024} ESC[32mdatanode_3 |ESC[0m at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ESC[32mdatanode_3 |ESC[0m at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2022) ESC[32mdatanode_3 |ESC[0m at org.apache.ratis.server.raftlog.RaftLog$EntryWithData.getEntry(RaftLog.java:472) ESC[32mdatanode_3 |ESC[0m at org.apache.ratis.util.DataQueue.pollList(DataQueue.java:134) ESC[32mdatanode_3 |ESC[0m at org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:220) ESC[32mdatanode_3 |ESC[0m at org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:178) ESC[32mdatanode_3 |ESC[0m at org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:121) ESC[32mdatanode_3 |ESC[0m at org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:76) ESC[32mdatanode_3 |ESC[0m at java.base/java.lang.Thread.run(Thread.java:834) {code} And the cluster is stuck in a bad state (couln't write any more chunk, ever) {code:java} datanode_1 | 2019-10-29 15:54:10,099 INFO impl.RaftServerImpl: 6b9ca1af-467f-40c7-a21d-118cb34080b1@group-CE458D4B0095: inconsistency entries. Reply:06f4231d-30a8-42fd-839e-aeaea7b1aa72<-6b9ca1af-467f-40c7-a21d-118cb34080b1#0:FAIL,INCONSISTENCY,nextIndex:8773,term:2,followerCommit:8768 {code} Fix me If I am wrong, but * I think the write path should work even if the cache is limited or there are unexpected sleep * If there are some inconsistencies the raft ring should be healed or closed and reopened (but it's an independent issue) > Datanode pipeline is failing with NoSuchFileException > ----------------------------------------------------- > > Key: HDDS-2372 > URL: https://issues.apache.org/jira/browse/HDDS-2372 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Reporter: Marton Elek > Priority: Critical > > Found it on a k8s based test cluster using a simple 3 node cluster and > HDDS-2327 freon test. After a while the StateMachine become unhealthy after > this error: > {code:java} > datanode-0 datanode java.util.concurrent.ExecutionException: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: > java.nio.file.NoSuchFileException: > /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830 > {code} > Can be reproduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org