[ 
https://issues.apache.org/jira/browse/IGNITE-17263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625230#comment-17625230
 ] 

Alexander Lapin commented on IGNITE-17263:
------------------------------------------

[~Denis Chudov] Looks good. Some follow-up tickets are required, however core 
implementation definitely should be merged. Thank you!

> Implement leader to replica safe time propagation
> -------------------------------------------------
>
>                 Key: IGNITE-17263
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17263
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Alexander Lapin
>            Assignee: Denis Chudov
>            Priority: Major
>              Labels: ignite-3, transaction3_ro
>             Fix For: 3.0.0-beta1
>
>         Attachments: Screenshot from 2022-07-06 16-48-30.png, Screenshot from 
> 2022-07-06 16-48-41.png
>
>
> In order to perform replica reads, it's required either to use read index or 
> check the safe time. Let's recall corresponding section from tx design 
> document.
> RO transactions can be executed on non-primary replicas. write intent 
> resolution doesn’t help because a write intent for a committed transaction 
> may not be yet replicated to the replica. To mitigate this issue, it’s enough 
> to run readIndex on each mapped partition leader, fetch the commit index and 
> wait on a replica until it’s applied. This will guarantee that all required 
> write intents are replicated and present locally. After that the normal write 
> intern resolution should do the job.
> There is a second option, which doesn’t require the network RTT. We can use a 
> special low watermark timestamp (safeTs) per replication group, which 
> corresponds to the apply index of a replicated entry, so then an apply index 
> is advanced during the replication, then the safeTs is monotonically 
> incremented too. The HLC used for safeTs advancing is assigned to a 
> replicated entry in an ordered way.
> Special measures are needed to periodically advance the safeTs if no updates 
> are happening. It’s enough to use a special replication command for this 
> purpose.
> All we need during RO txn is to wait until a safeTs advances past the RO txn 
> readTs. 
>  !Screenshot from 2022-07-06 16-48-30.png! 
> In the picture we have two concurrent transactions mapped to the same 
> partition: T1 and T2.
> OpReq(w1(x)) and OpReq(w2(x)) are received concurrently. Each write intent is 
> assigned a timestamp in a monotonic order consistent with the replication 
> order. This can be for example done when replication entries are dequeued for 
> processing by replication protocol (we assume entries are replicated 
> successively.
> It’s not enough only to wait for safeTs - it may never happen due to absence 
> of activity in the partition. Consider the next diagram:
>  !Screenshot from 2022-07-06 16-48-41.png! 
> We need an additional safeTsSync command to propagate a safeTs event in case 
> there are no updates in the partition.
> We need to linerialize safe time updates in all cases including leader 
> change. So we need a guarantee that safe time on non-primary replicas never 
> will be greater than HLC on leader (as we assume that primary replica is 
> colocated with leader). We are going to solve this problem by associating 
> every potential value of safeTime (propagated to the replica from leader via 
> appendEntries) with some log index, and this value (safe time candidate) 
> should be applied as new safe time value at the moment when corresponding 
> index is committed.
> Hence, the safeTimeSyncCommand also should be a Raft write command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to