Hey there, it's been a while since we last discussed this thread, but I
wanted to give a head's up that we're going ahead with implementing this
new multi-threaded replication protocol into Bedrock.  You can read more
about the final proposed design here:
https://github.com/Expensify/Bedrock/issues/65 . Thanks for all your help,
and I'd welcome any additional comments you can provide!

-david

PS: Since this conversation we also went ahead and open sourced the whole
thing; you can read more at http://bedrockdb.com/ -- I hope you'll give it
a shot!

On Thu, Apr 21, 2016 at 7:55 AM, David Barrett <[email protected]>
wrote:

> On Wed, Apr 20, 2016 at 9:20 PM, Adam Midvidy <[email protected]> wrote:
>
>> What are the consistency guarantees for reads from followers? In
>> particular can a client read data from a secondary node written by
>> transaction that has not yet been committed by a quorum of the cluster?
>>
>
> To clarify, our server (called Bedrock) has three layers of significance
> to this discussion -- a "Plugin" layer, a "cluster" layer, and the "SQLite"
> layer.
>
> The Plugin layer consists of a series of commands -- written in C++ --
> that execute several queries on the local database inside a transaction.
> In a sense, each plugin consists of a series of stored procedures.  Some of
> those stored procedures (which I'll call "transactions" for simplicity)
> consist exclusively of reads are executed on the slave, while those that
> have at least one write query are escalated to the master and executed
> there.
>
> The Cluster layer consists of the Paxos-based distributed transaction
> layer.  It picks a master, handles the escalation, etc.
>
> Finally, the SQLite layer is just the sqlite library.  You'll recall
> sqlite has no native networking capabilities, and instead talks to the
> Cluster and Plugin layers.
>
> So, with that context, today there are multiple read threads and a single
> write thread, on both the master and slaves.  Each read and write thread
> has its own database handle, and every thread currently processes at most
> one transaction at a time.  The read threads begin transactions, but always
> roll them back (because they don't actually execute any write queries).
> The write thread's behavior differs based on whether it's the master or a
> slave.  On the master, that transaction consists of the logic of the
> Plugin's stored procedure being executed.  On the slave, however, that
> write transaction consists of the outstanding transaction being approved
> via the two-phase commit protocol.
>
> During this period, the read threads (on both master and slaves) continue
> to process read transactions, but the read threads don't "see" the results
> of the write thread (whether on the master or the slave) until it's been
> approved.
>
> So, long story short, clients cannot read data until it's been approved by
> the master (which either waits for quorum, or just approves immediately if
> non-critical).
>
>
> 2) Is the APPROVE_COMMIT message sent after every transaction, or every
>>>> batch of transactions?
>>>>
>>>
>>> Actually... today we use a "selective synchronization" meaning we only
>>> actually wait for the commit to be approved by a quorum of the cluster on
>>> certain, high-sensitivity operations (eg, moving money).  This way the
>>> master can "race ahead" without waiting on most transactions, and we allow
>>> it to get up to 100 transactions ahead of slaves before waiting for a
>>> transaction to be approved.  And because today replication is serialized,
>>> in practice approving any given transaction retroactively approves
>>> committing all past transactions.  (This means the master can "fork" from
>>> the cluster in some scenarios, but  it gets kicked out automatically and we
>>> can apply corrective actions if necessary.)
>>>
>>
>> If I understand the above, it sounds like if the leader crashes, you can
>> lose up to the most 100 recent transactions worth of data. Is this correct?
>>
>
> In theory, yes.  In practice, we deploy two servers in each datacenter
> (across three geographically-distributed datacenters, so six nodes total in
> the cluster -- with one configured as a "permaslave" that doesn't
> participate in quorum).  Given this, the node that's in the same datacenter
> generally gets most if not all of the master's transactions in a real world
> crash scenario.
>
>
>
>> However, for the solution we're discussing here, we'd need to remove
>>> selective synchronization and do a separate APPROVE_COMMIT for every single
>>> transaction.  That by itself would come with an enormous performance
>>> *reduction*, because approving every single transaction when serialized
>>> would reduce throughput to the inverse of the median latency of the cluster
>>> (to get quorum).  However, I'm working on a way to run the approval logic
>>> in parallel, to get the benefit of separate approvals for each
>>> transactions, without the commensurate drop in replication performance.
>>>
>>
>> Can you elaborate on how running the approval logic in parallel would
>> increase throughput?
>>
>
> Well, right now each node has a single thread actually doing the writing.
> Were it not for selective synchronization, the master's write thread would
> spend most of its time idle, just waiting for a response from half of the
> slaves.  This is because right now, the master only replicates one
> transaction at a time -- it either approves that transaction, or rejects it
> (which never actually happens), but either way it needs to wait for that
> conclusion before even starting to process the next write command.
>
> What I'm proposing is that instead of sitting idle and waiting for the
> result of the "current" distributed transaction, it should just instead set
> that database handle aside (along with the uncommitted transaction it
> contains), and then open a new database handle (to start a new write
> transaction) so it can get started on the next command.  If we say 50% of
> the write thread's time is just waiting for a response, this way the write
> thread could process 2x as many write commands.
>
> However, in practice the thread is actually far less active than 50% when
> waiting for "full approval" for every command.  So a single write thread
> can actually process something like 50x more transactions if it doesn't
> need to wait for all slaves to approve each.  Accordingly, parallel
> transaction approval would give the same performance advantages of
> selective synchronization, but while obtaining full approval for each
> before committing on the master -- which would be necessary to identify
> conflicting transactions.
>
>
>
>> 4) In the case of a ROLLBACK, what do you rollback to? In my previous
>>>> example, even if you ROLLBACK the update (2) on the follower node after
>>>> committing the insert, this protocol is not guaranteed to make forward
>>>> progress since you would have to rollback the insert as well (which you
>>>> have already committed).
>>>>
>>>
>>> The only transactions that would be rolled back are those that touch a
>>> different set of pages than the master.  So in the case above, the INSERT
>>> would always commit (because the pages it touch are the same regardless of
>>> whether it's before or after the UPDATE), and the UPDATE would roll back
>>> for *everyone* if *any* of the slaves attempted to commit it in a different
>>> order than the master.  But if the master does the UPDATE before the INSERT
>>> -- and all the slaves follow suit -- then it'd go through just fine.  (And
>>> if the master did the UPDATE *after* the INSERT, so long as all the slaves
>>> did the same, again, it'd work.)  The only time it gets rolled back is if a
>>> slave commits in a different order than the master.
>>>
>>
>> I see a potential problem here. If each follower applies a batch in a
>> non-deterministic order, the chance that *any* follower applies them in
>> some conflicting order increases with the number of followers.
>>
>> For example, if we have a set of transactions where 2 conflict, and we
>> assume that secondaries choose an order at random in which to apply them,
>> then the probability of a transaction getting applied on a different order
>> on a given secondary is 50%. Therefore the probability of at least one
>> secondary getting rolled back is (1 - 1/(2^n)).
>>
>> So for a cluster of say ~9 nodes (so 8 secondary nodes), the chance of a
>> batch with 2 conflicting transactions failing to commit is 99.6%. This
>> would seem to prevent the cluster from making forward progress.
>>
>
> I agree with that -- odds are, if two transactions are in fact
> conflicting, at least one of the slaves will likely commit it in a
> different order than on the master, causing the transaction to be rejected
> by the master and retried later.  However, I'm not sure why this is
> necessarily a problem.  The cluster will still make forward progress on all
> of the non-conflicting transactions -- only the one that conflicted will be
> rejected (and quickly retried).
>
> Am I misunderstanding your concern?
>
> -david
>
>
_______________________________________________
p2p-hackers mailing list
[email protected]
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to