[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968264#comment-16968264
 ] 

Marton Elek commented on HDDS-2372:
-----------------------------------

I had a long conversation with [~shashikant] and He helped me a lot to 
understood the problem (thanks here, again). Here are our proposals:

 

[Problem 1]: race condition between read (read the statemachine data to send it 
to the followers) and commit

This can be solved using a a second read attempt after throwing the exception.

[Problem 2]: race condition between writeStateMachineData and 
readStateMachineData (the statemachine data write might not be finished when we 
start to read back the data (in case of missing cache entry).


This can be fixed with checking the size of the data and compare it with the 
length which is part of the chunk write request.

[Problem 3] race condition between close container / write chunk / read chunk : 
write chunk may be declined because the container is closed, in this case the 
read chunk error should be ignored silently instead of throwing an exception 
for ratis. This can be done with using the bcsid. If it's newer than the 
term/index of the close container, the request can be safely ignored.

> Datanode pipeline is failing with NoSuchFileException
> -----------------------------------------------------
>
>                 Key: HDDS-2372
>                 URL: https://issues.apache.org/jira/browse/HDDS-2372
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>            Reporter: Marton Elek
>            Assignee: Shashikant Banerjee
>            Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to