Thanks for the detailed reply, David. Some additional questions and
discussion follow. Some questions concern more general aspects of your
replication protocol, so feel free to leave them unanswered if you prefer
to discuss only the parallel replication batch application mechanism.

On Wed, Apr 20, 2016 at 3:08 PM, David Barrett <[email protected]>
wrote:

> Great questions:
>
> On Wed, Apr 20, 2016 at 6:21 AM, Adam Midvidy <[email protected]> wrote:
>
>> 1) Can follower nodes service reads, or are all reads serviced by the
>> leader?
>>
>
> All nodes CAN service reads, but only the master/leader does writes.  In
> practice, we have a load balancing pool that contains all the current
> slaves (which can change, if the master changes), and they receive all
> direct requests -- both reads and writes -- and then escalate all writes to
> the master.  However, the master *can* respond to reads, we just don't
> burden it with read traffic in normal operation.
>

What are the consistency guarantees for reads from followers? In particular
can a client read data from a secondary node written by transaction that
has not yet been committed by a quorum of the cluster?


>
>
>
>> 2) Is the APPROVE_COMMIT message sent after every transaction, or every
>> batch of transactions?
>>
>
> Actually... today we use a "selective synchronization" meaning we only
> actually wait for the commit to be approved by a quorum of the cluster on
> certain, high-sensitivity operations (eg, moving money).  This way the
> master can "race ahead" without waiting on most transactions, and we allow
> it to get up to 100 transactions ahead of slaves before waiting for a
> transaction to be approved.  And because today replication is serialized,
> in practice approving any given transaction retroactively approves
> committing all past transactions.  (This means the master can "fork" from
> the cluster in some scenarios, but  it gets kicked out automatically and we
> can apply corrective actions if necessary.)
>

If I understand the above, it sounds like if the leader crashes, you can
lose up to the most 100 recent transactions worth of data. Is this correct?


>
> However, for the solution we're discussing here, we'd need to remove
> selective synchronization and do a separate APPROVE_COMMIT for every single
> transaction.  That by itself would come with an enormous performance
> *reduction*, because approving every single transaction when serialized
> would reduce throughput to the inverse of the median latency of the cluster
> (to get quorum).  However, I'm working on a way to run the approval logic
> in parallel, to get the benefit of separate approvals for each
> transactions, without the commensurate drop in replication performance.
>
>
Can you elaborate on how running the approval logic in parallel would
increase throughput?



> 3) How does hashing work in the case of a delete? Do you send hashes of
>> the deleted pages as well?
>>
>
> Correct, I think a delete would be just like a write.
>
>

>
>> 4) In the case of a ROLLBACK, what do you rollback to? In my previous
>> example, even if you ROLLBACK the update (2) on the follower node after
>> committing the insert, this protocol is not guaranteed to make forward
>> progress since you would have to rollback the insert as well (which you
>> have already committed).
>>
>
> The only transactions that would be rolled back are those that touch a
> different set of pages than the master.  So in the case above, the INSERT
> would always commit (because the pages it touch are the same regardless of
> whether it's before or after the UPDATE), and the UPDATE would roll back
> for *everyone* if *any* of the slaves attempted to commit it in a different
> order than the master.  But if the master does the UPDATE before the INSERT
> -- and all the slaves follow suit -- then it'd go through just fine.  (And
> if the master did the UPDATE *after* the INSERT, so long as all the slaves
> did the same, again, it'd work.)  The only time it gets rolled back is if a
> slave commits in a different order than the master.
>

I see a potential problem here. If each follower applies a batch in a
non-deterministic order, the chance that *any* follower applies them in
some conflicting order increases with the number of followers.

For example, if we have a set of transactions where 2 conflict, and we
assume that secondaries choose an order at random in which to apply them,
then the probability of a transaction getting applied on a different order
on a given secondary is 50%. Therefore the probability of at least one
secondary getting rolled back is (1 - 1/(2^n)).

So for a cluster of say ~9 nodes (so 8 secondary nodes), the chance of a
batch with 2 conflicting transactions failing to commit is 99.6%. This
would seem to prevent the cluster from making forward progress.

Adam



> -david
>
> _______________________________________________
> p2p-hackers mailing list
> [email protected]
> http://lists.zooko.com/mailman/listinfo/p2p-hackers
>
>
_______________________________________________
p2p-hackers mailing list
[email protected]
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to