Delving into OT

Joseph Gentle Mon, 24 Jun 2013 13:07:20 -0700

I want to start the discussion around what OT algorithms to use. This
is going to get technical. This is not a bug. Please ask simple but
tangential questions (like how does wave's current crypto system
work?) in a new thread instead of this one.

Requirements:
- Arbitrary peers can syncronize data
- We can collaboratively edit tree structures, key-value pairs and
rich text. (I'd like arbitrary JSON editing)
- Fast enough to work on mobile devices. Protocol needs to be low
bandwidth and require a low number of network round-trips.
- Ideally a low storage overhead

Nice-to-have:
- End-to-end encryption of content. (In the wake of PRISM, the less
information my wave server needs to know about me, the better.)
- Some wave version of OTR[1]
- Presence & cursors
- Simple. At least, as simple as we can manage. Lets at least try :/

Is there anything else I'm missing? Some of this stuff is only
tangentially related to OT (like cursors and encryption), but its
still important we bring it up.

Given that we want to be able to connect arbitrary peers, I think it
makes sense to use a message encryption & signing system. Currently,
wave servers sign their user's messages. To support users
collaborating over a LAN without servers, we should allow users to
sign (& encrypt) their own messages. Unfortunately, the signature will
become invalid if the operation is transformed.

Consider peers A and B (communicating over a LAN), and disconnected
peer C. A and B each generate simultaneous operations and sign their
operations. A sends its operation to B, who transforms it. B then
connects to C and sends both operations. For C to verify A's
operation, C must see the original (untransformed) version of the
operation. I think the only way we can make this work securely is by
only sending original operations instead of sending operations after
they've been transformed.

As far as I know, there's three broad directions we could take algorithmically:
1. CRDTs (like Logoot[1])
2. 'Classical' OT using vector clocks for versioning.
3. The OT system that Torben Weis was working on. (Some might remember
him from the wave summit a few years ago). I have been chatting on and
off with him about this since the summit. He hasn't published anything
on it yet though. Its similar to classical OT, except using git style
version hashes.

CRDTs are probably the most elegant solution. They avoid the entire
complicated mess of transformation. However:
- They add a significant overhead to the document size (we must store
a unique name between every two characters in the wave)
- As far as I know, nobody has implemented arbitrary JSON editing
using CRDTs. Implementing a set is very difficult.
- Even without transform, we'll still have to store operations anyway
so peers can verify signatures.

I'd like to know what we should expect the actual document size
overhead to be. I expect it to be at least a 5x size increase, but I
don't know for sure. The papers I've read cleverly minimize the
overhead by only doing OT on a line-by-line basis, but thats
unacceptable for wave.

As I understand it, vector clocks are the size of the number of unique
peers which have ever connected to the network. (They contain an
integer version number per peer on the wave). As I understand them, a
vector clock must be stored in each operation - which to me always
seemed unpalatable enough I didn't explore them further. Note that
this is per *peer*, not per user. If we want full p2p, this is one per
device which has ever edited a wave.

There's probably some clever ways we could minimize that size. Off the
top of my head:
- Most operations are generated on the same site as their single
parent. These operations only increment the local version number. We
could special case this. I don't know how much better this would make
the system - I suspect a lot.
- If we want, we could build a hybrid system where most clients use a
client-server protocol like the current system. Servers could then use
vector clocks or whatever to confer amongst themselves. This way we
only need one element in the clock per server. However, we lose most
of the benefits of having a p2p OT system to begin with.

Torben's system is the one I know the most about. I'm biased to think
thats the right answer because I understand it better than the others.
I just spent the last couple hours writing up how it works, but I'm
going to run it by Torben before posting to this list.

It works like this:

First, we need a transform function that supports TP2. The simplest
way to do this is to add tombstones to the data type. (We'll need to
do this for the vector clock based system as well, by the way). I have
a plaintext implementation which does this in javascript[3], although
its quite slow (~150k transforms per second). I expect that an
optimized version of this will be around 10-20 million transforms /
second in C[4] for simple ops, and probably a few million per second
in java.

Second, make an inverse transform function (prune). P(T(a, b), b) = a.

In its current incantation, the algorithm works as follows:
Consider a DAG of all the operations. The closest metaphor is a
network of operations in git. Every operation has a unique hash
identifier and one or more parents (except the root, which has no
parents). It'll look something like this:
https://dl.dropboxusercontent.com/u/2494815/ops.png
... except it'll usually be waaaay less complicated than that.

The cool thing about this structure is that if you apply all of the
operations in that graph in *any order*, you'll end up in the same
document state. (So long as each node comes after all its parents and
you use the transform function correctly).

Each peer stores a history list of operations. Each operation is
stored twice. One copy is its original form (the exact bytes which
were used for hashing and signing). When sending the operation to a
remote peer, this is the copy we send. The second copy of the op is
transformed. At all times, the transformed ops in the history list are
written such that you could directly apply every operation in the
history list (in order) to a new document and you would reach the
current document state. The history lists on different nodes don't
have to be in the same order.

We also store a copy of the current document snapshot, because its
really useful for UI.

Consider the DAG again. Every operation in the history list is a node
in that DAG, with its parents as other nodes. The DAG will have at
least one node with no children (= the bottom nodes in that diagram I
linked above). These childless nodes form the *frontier*. When a new
operation is created, all ops in the frontier become the new node's
parents. The new node also stores the number of ancestors it has (=
history list length). I'll get to why thats useful that in a moment.

The history list has a special power because of the prune function:
operations which don't depend on one another can be swapped in place.

swap = (a, b) ->
b' = prune(b, a)
a' = transform(a, b)

When peers connect, they send each other missing ops. Figuring out
which ops are missing can be surprisingly tricky - but we'll figure
that out later. New ops must be ingested in order, so we always ingest
an operation after ingesting all of its parents.

So you're given an op from another peer to ingest. We need to
transform that operation by all operations we have locally that aren't
in the op's ancestors. You can't just arbitrarily transform two ops by
one another - they have to be both be applyable to the same document.
So what we do is rewrite our local history so that all of the
operation's ancestors are at the start of our history list, and all of
the other operations are at the end of the history. Once we've done
that, the operation can be transformed by all the ops in the second
half of our history list and applied (and appended at the end of the
history list).

Remember, we know how many ancestors the new operation has. We saved
that earlier when the op was created. We also have all of that
operation's parents in our history list. So we find the position of
the last parent the op has in the history list. If that operation is
at the same position as the new op's ancestor length, we're golden.

Imagine your history list contains:
A B X Y

and the peer you're connecting to has:
A B C

(C has B as a parent). C has 2 ancestors (A and B). When you ingest C,
you find B (its last parent), which is at position 2 in the history
list. So you know that everything after that in the history list isn't
one of C's parents, and you can transform C by X, Y and then apply it.

If our history list instead looks like this:

X A B Y
... then you'll find B is too late in your history list. You need to
rewrite your history list to move the X after B. You use the swap
function to do that, swapping X and A, then X and B. Your history then
looks like A B X Y like we wanted, so you can proceed as above.

This algorithm requires a lot of transforms, but I think most of them
are unavoidable because of the signing / trust issue. It also requires
that every operation (ie, every key press) has a unique ID. Thats
probably going to be a 16 byte hash or something; which seems like a
lot of extra bytes for typing one character.

I implemented it here:
https://github.com/josephg/tp2stuff/blob/master/node2.coffee

... though with the randomizer setup I have (a handful of peers, all
randomly generating operations and periodically (randomly) connecting
to each other to sync up.

5 clients generating operations, with two random clients syncing about
every 5 operations results in about 100x as many transforms as
operations. (10k ops -> 11M transforms). Doing the same thing with 3
clients 'only' needs 2M transforms. Performance stays steady over time
(it doesn't get worse as the document grows like CRDTs will) - this
scatterplot shows the number of seconds per 30 iterations on one of my
tests:
https://dl.dropboxusercontent.com/u/2494815/Screen%20Shot%202013-06-24%20at%201.01.42%20PM.png

I think this huge number of transforms required is one of the costs of
doing encryption & signing of the ops. We might be able to lower the
number of transforms required if we trust our servers with our keys -
but I'm not sure I want to do that. Real-world performance should be
much better than that anyway, but yeah - it makes me nervous. I'm
probably going to run more tests to see what kind of performance we
can expect in practice.

Blah. That was a lot of words. Thoughts?
-J

[1] http://www.cypherpunks.ca/otr/
[2] http://hal.archives-ouvertes.fr/docs/00/34/59/11/PDF/main.pdf
[3] https://github.com/share/ottypes/blob/master/lib/text-tp2.js
[4] based on my extremely fast non-tombstone equivalent data structure
here: https://github.com/share/libot/blob/master/text.h

Delving into OT

Reply via email to