[ https://issues.apache.org/jira/browse/HDDS-709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16682092#comment-16682092 ]
Shashikant Banerjee edited comment on HDDS-709 at 11/10/18 12:06 AM: --------------------------------------------------------------------- Thanks [~jnp], for the comments. {noformat} In checkIfContainerNotOpenException, why do we need to dig through exceptions? Is it possible to communicate back via protocol?{noformat} There are two ways to receive an exception at the client. One is to embed the error code on the datanode in the ContainerCommandResponse and pass in RaftClientReply msg. The other way to set the Exception inside RaftClientReply which is converted to StateMachineException and then CompletionException inside Ratis. In this case, since the operation will be failed at the startTransaction phase only, only way to propagate the error to the client is to set the exception in TransactionContext which will wrap the exception inside StateMachineException citing it as a failure in protocol and set it inside RaftClientReply. There is no ContainerCommandResponse in such case, as the command never gets executed in startTransaction. We need to handle the exception client and hence have to dig throw the wrapped exceptions. {noformat} if (containerState == State.OPEN || containerState == State.CLOSING) Ideally we should not need this check to mark container UNHEALTHY. For a CLOSED container, it should not even come to this code path. {noformat} This check is there mark the container unhealthy in case there is an applyTransaction failure while execution inside Datanode as per discussion in HDDS-579. For marking a Closed container unhealthy, either client should detect corrupted blocks and tell SCM to move the container to unhealthy/ or datanodeself it discover disk failures and mark container replica existing on these disks unhealthy. These cases are not covered in the scope of this Jira. Rest of the review comments are addressed in the v5 patch. was (Author: shashikant): Thanks[~jnp], for the comments. {noformat} In checkIfContainerNotOpenException, why do we need to dig through exceptions? Is it possible to communicate back via protocol?{noformat} There are two ways to receive an exception at the client. One is to embed the error code on the datanode in the ContainerCommandResponse and pass in RaftClientReply msg. The other way to set the Exception inside RaftClientReply which is converted to StateMachineException and then CompletionException inside Ratis. In this case, since the operation will be failed at the startTransaction phase only, only way to propagate the error to the client is to set the exception in TransactionContext which will wrap the exception inside StateMachineException citing it as a failure in protocol and set it inside RaftClientReply. There is no ContainerCommandResponse in such case, as the command never gets executed in startTransaction. We need to handle the exception client and hence have to dig throw the wrapped exceptions. {noformat} if (containerState == State.OPEN || containerState == State.CLOSING) Ideally we should not need this check to mark container UNHEALTHY. For a CLOSED container, it should not even come to this code path. {noformat} This check is there mark the container unhealthy in case there is an applyTransaction failure while execution inside Datanode as per discussion in HDDS-579. For marking a Closed container unhealthy, either client should detect corrupted blocks and tell SCM to move the container to unhealthy/ or datanodeself it discover disk failures and mark container replica existing on these disks unhealthy. These cases are not covered in the scope of this Jira. Rest of the review comments are addressed in the v5 patch. > Modify Close Container handling sequence on datanodes > ----------------------------------------------------- > > Key: HDDS-709 > URL: https://issues.apache.org/jira/browse/HDDS-709 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode > Reporter: Shashikant Banerjee > Assignee: Shashikant Banerjee > Priority: Major > Attachments: HDDS-709.000.patch, HDDS-709.001.patch, > HDDS-709.002.patch, HDDS-709.003.patch, HDDS-709.004.patch, HDDS-709.005.patch > > > With quasi closed container state for handling majority node failures, the > close container handling sequence in Datanodes need to change. Once the > datanodes receive a close container command from SCM, the open container > replicas individually be marked in the closing state. In a closing state, > only the transactions coming from the Ratis leader are allowed , all other > write transaction will fail. A close container transaction will be queued via > Ratis on the leader which will be replayed to the followers which makes it > transition to CLOSED/QUASI CLOSED state. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org