[jira] [Comment Edited] (HDDS-709) Modify Close Container handling sequence on datanodes

Shashikant Banerjee (JIRA) Fri, 09 Nov 2018 16:07:17 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16682092#comment-16682092
 ]


Shashikant Banerjee edited comment on HDDS-709 at 11/10/18 12:06 AM:
---------------------------------------------------------------------

Thanks [~jnp], for the comments.
{noformat}
In checkIfContainerNotOpenException, why do we need to dig through exceptions? 
Is it possible to communicate back via protocol?{noformat}
There are two ways to receive an exception at the client. One is to embed the 
error code on the datanode in the ContainerCommandResponse and pass in 
RaftClientReply msg. The other way to set the Exception inside RaftClientReply 
which is converted to StateMachineException and then CompletionException inside 
Ratis.

In this case, since the operation will be failed at the startTransaction phase 
only, only way to propagate the error to the client is to set the exception in 
TransactionContext which will wrap the exception inside StateMachineException 
citing it as a failure in protocol and set it inside RaftClientReply. There is 
no ContainerCommandResponse in such case, as the command never gets executed in 
startTransaction.

We need to handle the exception client and hence have to dig throw the wrapped 
exceptions.
{noformat}
if (containerState == State.OPEN || containerState == State.CLOSING) Ideally we 
should not need this check to mark container UNHEALTHY. For a CLOSED container, 
it should not even come to this code path. {noformat}
This check is there mark the container unhealthy in case there is an 
applyTransaction failure while execution inside Datanode as per discussion in 
HDDS-579. For marking a Closed container unhealthy, either client should detect 
corrupted blocks and tell SCM to move the container to unhealthy/ or 
datanodeself it discover disk failures and mark container replica existing on 
these disks unhealthy. These cases are not covered in the scope of this Jira.

Rest of the review comments are addressed in the v5 patch.


was (Author: shashikant):
Thanks[~jnp], for the comments.
{noformat}
In checkIfContainerNotOpenException, why do we need to dig through exceptions? 
Is it possible to communicate back via protocol?{noformat}
There are two ways to receive an exception at the client. One is to embed the 
error code on the datanode in the ContainerCommandResponse and pass in 
RaftClientReply msg. The other way to set the Exception inside RaftClientReply 
which is converted to StateMachineException and then CompletionException inside 
Ratis.

In this case, since the operation will be failed at the startTransaction phase 
only, only way to propagate the error to the client is to set the exception in 
TransactionContext which will wrap the exception inside StateMachineException 
citing it as a failure in protocol and set it inside RaftClientReply. There is 
no ContainerCommandResponse in such case, as the command never gets executed in 
startTransaction.

We need to handle the exception client and hence have to dig throw the wrapped 
exceptions.
{noformat}
if (containerState == State.OPEN || containerState == State.CLOSING) Ideally we 
should not need this check to mark container UNHEALTHY. For a CLOSED container, 
it should not even come to this code path. {noformat}
This check is there mark the container unhealthy in case there is an 
applyTransaction failure while execution inside Datanode as per discussion in 
HDDS-579. For marking a Closed container unhealthy, either client should detect 
corrupted blocks and tell SCM to move the container to unhealthy/ or 
datanodeself it discover disk failures and mark container replica existing on 
these disks unhealthy. These cases are not covered in the scope of this Jira.

Rest of the review comments are addressed in the v5 patch.

> Modify Close Container handling sequence on datanodes
> -----------------------------------------------------
>
>                 Key: HDDS-709
>                 URL: https://issues.apache.org/jira/browse/HDDS-709
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: Ozone Datanode
>            Reporter: Shashikant Banerjee
>            Assignee: Shashikant Banerjee
>            Priority: Major
>         Attachments: HDDS-709.000.patch, HDDS-709.001.patch, 
> HDDS-709.002.patch, HDDS-709.003.patch, HDDS-709.004.patch, HDDS-709.005.patch
>
>
> With quasi closed container state for handling majority node failures, the 
> close container handling sequence in Datanodes need to change. Once the 
> datanodes receive a close container command from SCM, the open container 
> replicas individually be marked in the closing state. In a closing state, 
> only the transactions coming from the Ratis leader  are allowed , all other 
> write transaction will fail. A close container transaction will be queued via 
> Ratis on the leader which will be replayed to the followers which makes it 
> transition to CLOSED/QUASI CLOSED state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDDS-709) Modify Close Container handling sequence on datanodes

Reply via email to