[ 
https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113488#comment-17113488
 ] 

Bharat Viswanadham edited comment on HDDS-3354 at 5/21/20, 8:55 PM:
--------------------------------------------------------------------

{quote}That's an interesting question. If I understood well, your objections is 
that if we do ratis log snapshot and rocksdb snapshot (=checkpoint) at the same 
time they can be inconsistent with each other in case of any error.

I don't think it's a problem. Writing ratis log snapshot can fail even now 
which should be handled. The only question if we can finalize both the 
snapshots in one step which should be possible: For example write ratis log 
snapshot file and rocksdb snapshot file to the same directory and move it to 
the final location.{quote}

*Let me share my complete thought process here.*
1. Even in writing to temporary directory both ratis log snapshot file which 
has snapshot index and checkpoint which we use rocksdb checkpoint, let's say 
checkpoint succeded, and write failed. And in the current directory of snapshot 
we have old checkpoint, and old snapshot file. During OM restart if we use 
current OM DB, we cannot avoid replay logic. So, in this case, whenever OM 
restart, we should use last checkpoint DB and snapshot file and come up. As 
said we agree with this it will delay startup until the leader applied all OM 
entries from snapshot to latest log, clients will get LeaderNotReadyException. 
These kind of issues, will not be seen with the proposed approach. And also 
thinking more, got one more case, let's say we take snapshot at every 100k, now 
OM has completed only 90K transactions which mean snapshot/checkpoint will not 
be taken, so during restart we should delete existing DB, and come up with new 
DB.

2. And one step failing is not only the issue, it is one of the issues. If 
snapshot taken is controlled by ratis, when a checkpoint is happening, we 
should not allow any transactions to be flushed to DB, as we want to get what 
is exact last applied Transaction to DB, so that when restart happens, we want 
to know what is last applied transaction to DB. If this happens, that means 
every time checkpoint is happening, we need to stop double buffer and take a 
checkpoint and write to the snapshot file. Stopping double buffer means right 
now it will send a signal to interrupt flush thread, but now with this we 
should still maintain unflushed transactions that are not completed by flush or 
wait for flush to complete. So, this might increase the current queue length in 
double-buffer. As apply transaction will still continue to apply transactions 
to StateMachine. This looks till complex than what is proposed and it also 
comes with its own disadvantages of startup slowness and double-buffer queue 
size. Or if we think let's take 

Or other approaches is instead of putting transaction info, repeat the above 
process of checkpoint and snapshot to file for every iteration, so that we 
don't stop double buffer and apply transaction will put to double buffer. But 
this is not a great solution, as it will makes double buffer slow and 
checkpoints also increased (Just want to point it out),  and we need another 
background thread for cleaning up. This will not have a startup slow problem.

As with testing it is shown with HDDS-3474 + HDDS-3475 performance is not 
degraded and it is in par, and with this we shall remove the replay logic from 
actual request logic. So, even if we want to revisit, it will be simpler and it 
will make developers implementing new API's does not need to know about 
handling a replay case when implementing new write non-idempotent requests.

{quote}I wouldn't like to say it's better. But I think it's possible (How is 
your coffee?){quote}
I am also not ruling out completely, it is not possible, with current solution 
if it solves issue, and it has not shown any performance degrade, I think it is 
fine to go with for now. And also the other approach comes with its own set of 
problems and has some disadvantages. The goal for this, to make not have a 
replay logic in actual requests and make startup time faster





was (Author: bharatviswa):
{quote}That's an interesting question. If I understood well, your objections is 
that if we do ratis log snapshot and rocksdb snapshot (=checkpoint) at the same 
time they can be inconsistent with each other in case of any error.

I don't think it's a problem. Writing ratis log snapshot can fail even now 
which should be handled. The only question if we can finalize both the 
snapshots in one step which should be possible: For example write ratis log 
snapshot file and rocksdb snapshot file to the same directory and move it to 
the final location.{quote}

*Let me share my complete thought process here.*
1. Even in writing to temporary directory both ratis log snapshot file which 
has snapshot index and checkpoint which we use rocksdb checkpoint, let's say 
checkpoint succeded, and write failed. And in the current directory of snapshot 
we have old checkpoint, and old snapshot file. During OM restart if we use 
current OM DB, we cannot avoid replay logic. So, in this case, whenever OM 
restart, we should use last checkpoint DB and snapshot file and come up. As 
said we agree with this it will delay startup until the leader applied all OM 
entries from snapshot to latest log, clients will get LeaderNotReadyException. 
These kind of issues, will not be seen with the proposed approach.

2. And one step failing is not only the issue, it is one of the issues. If 
snapshot taken is controlled by ratis, when a checkpoint is happening, we 
should not allow any transactions to be flushed to DB, as we want to get what 
is exact last applied Transaction to DB, so that when restart happens, we want 
to know what is last applied transaction to DB. If this happens, that means 
every time checkpoint is happening, we need to stop double buffer and take a 
checkpoint and write to the snapshot file. Stopping double buffer means right 
now it will send a signal to interrupt flush thread, but now with this we 
should still maintain unflushed transactions that are not completed by flush or 
wait for flush to complete. So, this might increase the current queue length in 
double-buffer. As apply transaction will still continue to apply transactions 
to StateMachine. This looks till complex than what is proposed and it also 
comes with its own disadvantages of startup slowness and double-buffer queue 
size. Or if we think let's take 

Or other approaches is instead of putting transaction info, repeat the above 
process of checkpoint and snapshot to file for every iteration, so that we 
don't stop double buffer and apply transaction will put to double buffer. But 
this is not a great solution, as it will makes double buffer slow and 
checkpoints also increased (Just want to point it out),  and we need another 
background thread for cleaning up. This will not have a startup slow problem.

As with testing it is shown with HDDS-3474 + HDDS-3475 performance is not 
degraded and it is in par, and with this we shall remove the replay logic from 
actual request logic. So, even if we want to revisit, it will be simpler and it 
will make developers implementing new API's does not need to know about 
handling a replay case when implementing new write non-idempotent requests.

{quote}I wouldn't like to say it's better. But I think it's possible (How is 
your coffee?){quote}
I am also not ruling out completely, it is not possible, with current solution 
if it solves issue, and it has not shown any performance degrade, I think it is 
fine to go with for now. And also the other approach comes with its own set of 
problems and has some disadvantages. The goal for this, to make not have a 
replay logic in actual requests and make startup time faster




> OM HA replay optimization
> -------------------------
>
>                 Key: HDDS-3354
>                 URL: https://issues.apache.org/jira/browse/HDDS-3354
>             Project: Hadoop Distributed Data Store
>          Issue Type: Improvement
>            Reporter: Bharat Viswanadham
>            Assignee: Bharat Viswanadham
>            Priority: Major
>         Attachments: OM HA Replay.pdf, Screen Shot 2020-05-20 at 1.28.48 
> PM.png
>
>
> This Jira is to improve the OM HA replay scenario.
> Attached the design document which discusses about the proposal and issue in 
> detail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

Reply via email to