[ https://issues.apache.org/jira/browse/IGNITE-17263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexander Lapin updated IGNITE-17263: ------------------------------------- Fix Version/s: 3.0.0-beta1 > Implement leader to replica safe time propagation > ------------------------------------------------- > > Key: IGNITE-17263 > URL: https://issues.apache.org/jira/browse/IGNITE-17263 > Project: Ignite > Issue Type: Improvement > Reporter: Alexander Lapin > Assignee: Denis Chudov > Priority: Major > Labels: ignite-3, transaction3_ro > Fix For: 3.0.0-beta1 > > Attachments: Screenshot from 2022-07-06 16-48-30.png, Screenshot from > 2022-07-06 16-48-41.png > > > In order to perform replica reads, it's required either to use read index or > check the safe time. Let's recall corresponding section from tx design > document. > RO transactions can be executed on non-primary replicas. write intent > resolution doesn’t help because a write intent for a committed transaction > may not be yet replicated to the replica. To mitigate this issue, it’s enough > to run readIndex on each mapped partition leader, fetch the commit index and > wait on a replica until it’s applied. This will guarantee that all required > write intents are replicated and present locally. After that the normal write > intern resolution should do the job. > There is a second option, which doesn’t require the network RTT. We can use a > special low watermark timestamp (safeTs) per replication group, which > corresponds to the apply index of a replicated entry, so then an apply index > is advanced during the replication, then the safeTs is monotonically > incremented too. The HLC used for safeTs advancing is assigned to a > replicated entry in an ordered way. > Special measures are needed to periodically advance the safeTs if no updates > are happening. It’s enough to use a special replication command for this > purpose. > All we need during RO txn is to wait until a safeTs advances past the RO txn > readTs. > !Screenshot from 2022-07-06 16-48-30.png! > In the picture we have two concurrent transactions mapped to the same > partition: T1 and T2. > OpReq(w1(x)) and OpReq(w2(x)) are received concurrently. Each write intent is > assigned a timestamp in a monotonic order consistent with the replication > order. This can be for example done when replication entries are dequeued for > processing by replication protocol (we assume entries are replicated > successively. > It’s not enough only to wait for safeTs - it may never happen due to absence > of activity in the partition. Consider the next diagram: > !Screenshot from 2022-07-06 16-48-41.png! > We need an additional safeTsSync command to propagate a safeTs event in case > there are no updates in the partition. > We need to linerialize safe time updates in all cases including leader > change. So we need a guarantee that safe time on non-primary replicas never > will be greater than HLC on leader (as we assume that primary replica is > colocated with leader). We are going to solve this problem by associating > every potential value of safeTime (propagated to the replica from leader via > appendEntries) with some log index, and this value (safe time candidate) > should be applied as new safe time value at the moment when corresponding > index is committed. > Hence, the safeTimeSyncCommand also should be a Raft write command. -- This message was sent by Atlassian Jira (v8.20.10#820010)