Hey there, it's been a while since we last discussed this thread, but I wanted to give a head's up that we're going ahead with implementing this new multi-threaded replication protocol into Bedrock. You can read more about the final proposed design here: https://github.com/Expensify/Bedrock/issues/65 . Thanks for all your help, and I'd welcome any additional comments you can provide!
-david PS: Since this conversation we also went ahead and open sourced the whole thing; you can read more at http://bedrockdb.com/ -- I hope you'll give it a shot! On Thu, Apr 21, 2016 at 7:55 AM, David Barrett <[email protected]> wrote: > On Wed, Apr 20, 2016 at 9:20 PM, Adam Midvidy <[email protected]> wrote: > >> What are the consistency guarantees for reads from followers? In >> particular can a client read data from a secondary node written by >> transaction that has not yet been committed by a quorum of the cluster? >> > > To clarify, our server (called Bedrock) has three layers of significance > to this discussion -- a "Plugin" layer, a "cluster" layer, and the "SQLite" > layer. > > The Plugin layer consists of a series of commands -- written in C++ -- > that execute several queries on the local database inside a transaction. > In a sense, each plugin consists of a series of stored procedures. Some of > those stored procedures (which I'll call "transactions" for simplicity) > consist exclusively of reads are executed on the slave, while those that > have at least one write query are escalated to the master and executed > there. > > The Cluster layer consists of the Paxos-based distributed transaction > layer. It picks a master, handles the escalation, etc. > > Finally, the SQLite layer is just the sqlite library. You'll recall > sqlite has no native networking capabilities, and instead talks to the > Cluster and Plugin layers. > > So, with that context, today there are multiple read threads and a single > write thread, on both the master and slaves. Each read and write thread > has its own database handle, and every thread currently processes at most > one transaction at a time. The read threads begin transactions, but always > roll them back (because they don't actually execute any write queries). > The write thread's behavior differs based on whether it's the master or a > slave. On the master, that transaction consists of the logic of the > Plugin's stored procedure being executed. On the slave, however, that > write transaction consists of the outstanding transaction being approved > via the two-phase commit protocol. > > During this period, the read threads (on both master and slaves) continue > to process read transactions, but the read threads don't "see" the results > of the write thread (whether on the master or the slave) until it's been > approved. > > So, long story short, clients cannot read data until it's been approved by > the master (which either waits for quorum, or just approves immediately if > non-critical). > > > 2) Is the APPROVE_COMMIT message sent after every transaction, or every >>>> batch of transactions? >>>> >>> >>> Actually... today we use a "selective synchronization" meaning we only >>> actually wait for the commit to be approved by a quorum of the cluster on >>> certain, high-sensitivity operations (eg, moving money). This way the >>> master can "race ahead" without waiting on most transactions, and we allow >>> it to get up to 100 transactions ahead of slaves before waiting for a >>> transaction to be approved. And because today replication is serialized, >>> in practice approving any given transaction retroactively approves >>> committing all past transactions. (This means the master can "fork" from >>> the cluster in some scenarios, but it gets kicked out automatically and we >>> can apply corrective actions if necessary.) >>> >> >> If I understand the above, it sounds like if the leader crashes, you can >> lose up to the most 100 recent transactions worth of data. Is this correct? >> > > In theory, yes. In practice, we deploy two servers in each datacenter > (across three geographically-distributed datacenters, so six nodes total in > the cluster -- with one configured as a "permaslave" that doesn't > participate in quorum). Given this, the node that's in the same datacenter > generally gets most if not all of the master's transactions in a real world > crash scenario. > > > >> However, for the solution we're discussing here, we'd need to remove >>> selective synchronization and do a separate APPROVE_COMMIT for every single >>> transaction. That by itself would come with an enormous performance >>> *reduction*, because approving every single transaction when serialized >>> would reduce throughput to the inverse of the median latency of the cluster >>> (to get quorum). However, I'm working on a way to run the approval logic >>> in parallel, to get the benefit of separate approvals for each >>> transactions, without the commensurate drop in replication performance. >>> >> >> Can you elaborate on how running the approval logic in parallel would >> increase throughput? >> > > Well, right now each node has a single thread actually doing the writing. > Were it not for selective synchronization, the master's write thread would > spend most of its time idle, just waiting for a response from half of the > slaves. This is because right now, the master only replicates one > transaction at a time -- it either approves that transaction, or rejects it > (which never actually happens), but either way it needs to wait for that > conclusion before even starting to process the next write command. > > What I'm proposing is that instead of sitting idle and waiting for the > result of the "current" distributed transaction, it should just instead set > that database handle aside (along with the uncommitted transaction it > contains), and then open a new database handle (to start a new write > transaction) so it can get started on the next command. If we say 50% of > the write thread's time is just waiting for a response, this way the write > thread could process 2x as many write commands. > > However, in practice the thread is actually far less active than 50% when > waiting for "full approval" for every command. So a single write thread > can actually process something like 50x more transactions if it doesn't > need to wait for all slaves to approve each. Accordingly, parallel > transaction approval would give the same performance advantages of > selective synchronization, but while obtaining full approval for each > before committing on the master -- which would be necessary to identify > conflicting transactions. > > > >> 4) In the case of a ROLLBACK, what do you rollback to? In my previous >>>> example, even if you ROLLBACK the update (2) on the follower node after >>>> committing the insert, this protocol is not guaranteed to make forward >>>> progress since you would have to rollback the insert as well (which you >>>> have already committed). >>>> >>> >>> The only transactions that would be rolled back are those that touch a >>> different set of pages than the master. So in the case above, the INSERT >>> would always commit (because the pages it touch are the same regardless of >>> whether it's before or after the UPDATE), and the UPDATE would roll back >>> for *everyone* if *any* of the slaves attempted to commit it in a different >>> order than the master. But if the master does the UPDATE before the INSERT >>> -- and all the slaves follow suit -- then it'd go through just fine. (And >>> if the master did the UPDATE *after* the INSERT, so long as all the slaves >>> did the same, again, it'd work.) The only time it gets rolled back is if a >>> slave commits in a different order than the master. >>> >> >> I see a potential problem here. If each follower applies a batch in a >> non-deterministic order, the chance that *any* follower applies them in >> some conflicting order increases with the number of followers. >> >> For example, if we have a set of transactions where 2 conflict, and we >> assume that secondaries choose an order at random in which to apply them, >> then the probability of a transaction getting applied on a different order >> on a given secondary is 50%. Therefore the probability of at least one >> secondary getting rolled back is (1 - 1/(2^n)). >> >> So for a cluster of say ~9 nodes (so 8 secondary nodes), the chance of a >> batch with 2 conflicting transactions failing to commit is 99.6%. This >> would seem to prevent the cluster from making forward progress. >> > > I agree with that -- odds are, if two transactions are in fact > conflicting, at least one of the slaves will likely commit it in a > different order than on the master, causing the transaction to be rejected > by the master and retried later. However, I'm not sure why this is > necessarily a problem. The cluster will still make forward progress on all > of the non-conflicting transactions -- only the one that conflicted will be > rejected (and quickly retried). > > Am I misunderstanding your concern? > > -david > >
_______________________________________________ p2p-hackers mailing list [email protected] http://lists.zooko.com/mailman/listinfo/p2p-hackers
