[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961938#comment-16961938
 ] 

Marton Elek commented on HDDS-2372:
-----------------------------------

Let's say I am writing chunks. Imagine the following timing.

 

Flow A
 # Leader receive the write chunk request
 # Write chunk is written to the disk (WRITE_DATE) and saved to the cache
 # WriteChunk is sent to the Follower1 with the next HB
 # As the WriteChunk has beed added to the Follower1 and Leader1 it can be 
committed
 # Write chunk write is called (COMMIT_DATA) the tmp file is renamed to the 
final name

 

Flow B
 #  HB should be sent to Follower2
 # For some reason cache is empty (too many other requests?) the write chunk 
should be read from the disk
 # A new ReadChunk request is executed by the HddsDispatcher and the chunk data 
is read (from an other thread, it's *async*)
 # The read HB is sent to the leader

 

As B.3 is an async operation it's possible that during the B.3 the write chunk 
is committed (A.5) and the chunk can't be read any more from the tmp file. 

 

> Datanode pipeline is failing with NoSuchFileException
> -----------------------------------------------------
>
>                 Key: HDDS-2372
>                 URL: https://issues.apache.org/jira/browse/HDDS-2372
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>            Reporter: Marton Elek
>            Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to