[ https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113488#comment-17113488 ]
Bharat Viswanadham edited comment on HDDS-3354 at 5/21/20, 8:55 PM: -------------------------------------------------------------------- {quote}That's an interesting question. If I understood well, your objections is that if we do ratis log snapshot and rocksdb snapshot (=checkpoint) at the same time they can be inconsistent with each other in case of any error. I don't think it's a problem. Writing ratis log snapshot can fail even now which should be handled. The only question if we can finalize both the snapshots in one step which should be possible: For example write ratis log snapshot file and rocksdb snapshot file to the same directory and move it to the final location.{quote} *Let me share my complete thought process here.* 1. Even in writing to temporary directory both ratis log snapshot file which has snapshot index and checkpoint which we use rocksdb checkpoint, let's say checkpoint succeded, and write failed. And in the current directory of snapshot we have old checkpoint, and old snapshot file. During OM restart if we use current OM DB, we cannot avoid replay logic. So, in this case, whenever OM restart, we should use last checkpoint DB and snapshot file and come up. As said we agree with this it will delay startup until the leader applied all OM entries from snapshot to latest log, clients will get LeaderNotReadyException. These kind of issues, will not be seen with the proposed approach. And also thinking more, got one more case, let's say we take snapshot at every 100k, now OM has completed only 90K transactions which mean snapshot/checkpoint will not be taken, so during restart we should delete existing DB, and come up with new DB. 2. And one step failing is not only the issue, it is one of the issues. If snapshot taken is controlled by ratis, when a checkpoint is happening, we should not allow any transactions to be flushed to DB, as we want to get what is exact last applied Transaction to DB, so that when restart happens, we want to know what is last applied transaction to DB. If this happens, that means every time checkpoint is happening, we need to stop double buffer and take a checkpoint and write to the snapshot file. Stopping double buffer means right now it will send a signal to interrupt flush thread, but now with this we should still maintain unflushed transactions that are not completed by flush or wait for flush to complete. So, this might increase the current queue length in double-buffer. As apply transaction will still continue to apply transactions to StateMachine. This looks till complex than what is proposed and it also comes with its own disadvantages of startup slowness and double-buffer queue size. Or if we think let's take Or other approaches is instead of putting transaction info, repeat the above process of checkpoint and snapshot to file for every iteration, so that we don't stop double buffer and apply transaction will put to double buffer. But this is not a great solution, as it will makes double buffer slow and checkpoints also increased (Just want to point it out), and we need another background thread for cleaning up. This will not have a startup slow problem. As with testing it is shown with HDDS-3474 + HDDS-3475 performance is not degraded and it is in par, and with this we shall remove the replay logic from actual request logic. So, even if we want to revisit, it will be simpler and it will make developers implementing new API's does not need to know about handling a replay case when implementing new write non-idempotent requests. {quote}I wouldn't like to say it's better. But I think it's possible (How is your coffee?){quote} I am also not ruling out completely, it is not possible, with current solution if it solves issue, and it has not shown any performance degrade, I think it is fine to go with for now. And also the other approach comes with its own set of problems and has some disadvantages. The goal for this, to make not have a replay logic in actual requests and make startup time faster was (Author: bharatviswa): {quote}That's an interesting question. If I understood well, your objections is that if we do ratis log snapshot and rocksdb snapshot (=checkpoint) at the same time they can be inconsistent with each other in case of any error. I don't think it's a problem. Writing ratis log snapshot can fail even now which should be handled. The only question if we can finalize both the snapshots in one step which should be possible: For example write ratis log snapshot file and rocksdb snapshot file to the same directory and move it to the final location.{quote} *Let me share my complete thought process here.* 1. Even in writing to temporary directory both ratis log snapshot file which has snapshot index and checkpoint which we use rocksdb checkpoint, let's say checkpoint succeded, and write failed. And in the current directory of snapshot we have old checkpoint, and old snapshot file. During OM restart if we use current OM DB, we cannot avoid replay logic. So, in this case, whenever OM restart, we should use last checkpoint DB and snapshot file and come up. As said we agree with this it will delay startup until the leader applied all OM entries from snapshot to latest log, clients will get LeaderNotReadyException. These kind of issues, will not be seen with the proposed approach. 2. And one step failing is not only the issue, it is one of the issues. If snapshot taken is controlled by ratis, when a checkpoint is happening, we should not allow any transactions to be flushed to DB, as we want to get what is exact last applied Transaction to DB, so that when restart happens, we want to know what is last applied transaction to DB. If this happens, that means every time checkpoint is happening, we need to stop double buffer and take a checkpoint and write to the snapshot file. Stopping double buffer means right now it will send a signal to interrupt flush thread, but now with this we should still maintain unflushed transactions that are not completed by flush or wait for flush to complete. So, this might increase the current queue length in double-buffer. As apply transaction will still continue to apply transactions to StateMachine. This looks till complex than what is proposed and it also comes with its own disadvantages of startup slowness and double-buffer queue size. Or if we think let's take Or other approaches is instead of putting transaction info, repeat the above process of checkpoint and snapshot to file for every iteration, so that we don't stop double buffer and apply transaction will put to double buffer. But this is not a great solution, as it will makes double buffer slow and checkpoints also increased (Just want to point it out), and we need another background thread for cleaning up. This will not have a startup slow problem. As with testing it is shown with HDDS-3474 + HDDS-3475 performance is not degraded and it is in par, and with this we shall remove the replay logic from actual request logic. So, even if we want to revisit, it will be simpler and it will make developers implementing new API's does not need to know about handling a replay case when implementing new write non-idempotent requests. {quote}I wouldn't like to say it's better. But I think it's possible (How is your coffee?){quote} I am also not ruling out completely, it is not possible, with current solution if it solves issue, and it has not shown any performance degrade, I think it is fine to go with for now. And also the other approach comes with its own set of problems and has some disadvantages. The goal for this, to make not have a replay logic in actual requests and make startup time faster > OM HA replay optimization > ------------------------- > > Key: HDDS-3354 > URL: https://issues.apache.org/jira/browse/HDDS-3354 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Reporter: Bharat Viswanadham > Assignee: Bharat Viswanadham > Priority: Major > Attachments: OM HA Replay.pdf, Screen Shot 2020-05-20 at 1.28.48 > PM.png > > > This Jira is to improve the OM HA replay scenario. > Attached the design document which discusses about the proposal and issue in > detail. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org