On Tue, Jun 14, 2022 at 01:40:56PM +0200, Ondřej Kuzník wrote:
> It's becoming untenable how a plain refresh cannot be represented in
> accesslog in a way that's capable of serving a deltasync session.
> Whatever happens, we have lost a fair amount of information to run a
> proper deltasync yet if we don't want to abandon this functionality, we
> have to try and fill some in.

There is no record of the expectations for deltasync in a multiprovider
environment, so probably worth putting this down too what they are from
my view, not necessarily Howard's, who actually wrote the thing.

The intention is convergence[0] first and foremost, while sending the
changes rather than the full entries.

Since conflicting writes will always happen and every node will have
their view of the DB at the time, they might make a different decision.
In deltasync, each will record the final modification into their
accesslog which might differ and this cascades bounded only by the
number of hosts involved[1]. In the end, we need all hosts that read
various versions and subsections of the log in varying order to always
converge.

A historic expectation has always been that accesslog be written in
order and relayed in the same order as written[2], implicitly assuming
that CSNs for each SID will always be stored in a non-descending order.
This is why some backends (back-ldif) are not suited to contain
accesslog DBs. This non-descending storage expectation might need to be
revisited if necessary, hopefully not.

Another expectation is that fallback behaviour be both graceful and
efficient: similarly to convergence, sessions will eventually move back
to deltasync if at arbitrary point we were to stop introducing
*conflicting* changes into the environment. At the same time, for the
sake of convergence, we need to be tolerant of some/all links running a
plain syncrepl refresh at some points in time.

We have to expect to be running in a real-world environment, where
arbitrary[3] topologies might be in place and any number of links and/or
nodes can be out of commission for any amount of time. When isolated
node rejoin, they should be able to converge eventually, regardless how
long the isolation/partition was. Still, we can't require that the
accesslog size is unbounded and need to be able to detect when we don't
retain the relevant data anymore and work around it[4].

Each node's accesslog DB should always be self-consistent: If a
read-only consumer starts with the same DB as the provider at some
point, they shall always be able to replay its accesslog cleanly,
regardless of what kinds of conflict resolution the provider had to go
through. N.B. If it is impossible to write a self-consistent accesslog
in certain situations, it is ok to pretend as if certain parts of the
accesslog have already been purged, e.g. by attaching meta-information
understood by syncprov to that effect.

Regardless of the promises stated above, we should also expect that
administrators deploying any multiprovider environment actively monitor
it. Just like backups, if replication is not checked routinely, it
almost always breaks when you actually need it. There are multiple
resources on how to do so and more tools can and will be developed as
the need is identified.

There are also some non-expectations, generally shared with plain
syncrepl anyway:
- If a host/network link is underpowered for the amount of changes
  coming in, they might fall behind, this doesn't affect eventual
  convergence[0] and it is up to the administrator to size their
  environment correctly
- Any host configured to accept writes will do so, allowing conflicts to
  arise, any/all of these might be (partially) reverted in the face of
  conflicting writes elsewhere in the environment, note that this is
  already the case with plain syncrepl
- We do not aim to minimise the number of "redundant" messages passed if
  there are multiple paths between nodes. LDAP semantics do not allow
  this to be done in a safe way with a CSN based replication system

I hope I haven't missed anything important.

[0]. Let's take the usual definition of eventual convergence - if at
     arbitrary point we were to stop introducing new changes to the
     environment and restore connectivity, all participating nodes will
     arrive at identical content in a finite number of steps (and
     there's a way to tell when that's happened)
[1]. Contrast this with "log replication" in Raft et al., where all
     members of a cluster coordinate to build a shared view of the
     actual history, not accepting a change until it has been accepted
     by a majority
[2]. If this assumption is violated like in ITS#9358, the consumer will
     have to skip some legitimate operations and diverge
[3]. We can still assume that in the designed topology all the nodes
     that accept write operations belong to the same strongly connected
     component
[4]. This is the assumption that was at the core of the issue described
     in ITS#9823

-- 
Ondřej Kuzník
Senior Software Engineer
Symas Corporation                       http://www.symas.com
Packaged, certified, and supported LDAP solutions powered by OpenLDAP

Reply via email to