On Wed, Apr 20, 2016 at 9:20 PM, Adam Midvidy <[email protected]> wrote:
> What are the consistency guarantees for reads from followers? In > particular can a client read data from a secondary node written by > transaction that has not yet been committed by a quorum of the cluster? > To clarify, our server (called Bedrock) has three layers of significance to this discussion -- a "Plugin" layer, a "cluster" layer, and the "SQLite" layer. The Plugin layer consists of a series of commands -- written in C++ -- that execute several queries on the local database inside a transaction. In a sense, each plugin consists of a series of stored procedures. Some of those stored procedures (which I'll call "transactions" for simplicity) consist exclusively of reads are executed on the slave, while those that have at least one write query are escalated to the master and executed there. The Cluster layer consists of the Paxos-based distributed transaction layer. It picks a master, handles the escalation, etc. Finally, the SQLite layer is just the sqlite library. You'll recall sqlite has no native networking capabilities, and instead talks to the Cluster and Plugin layers. So, with that context, today there are multiple read threads and a single write thread, on both the master and slaves. Each read and write thread has its own database handle, and every thread currently processes at most one transaction at a time. The read threads begin transactions, but always roll them back (because they don't actually execute any write queries). The write thread's behavior differs based on whether it's the master or a slave. On the master, that transaction consists of the logic of the Plugin's stored procedure being executed. On the slave, however, that write transaction consists of the outstanding transaction being approved via the two-phase commit protocol. During this period, the read threads (on both master and slaves) continue to process read transactions, but the read threads don't "see" the results of the write thread (whether on the master or the slave) until it's been approved. So, long story short, clients cannot read data until it's been approved by the master (which either waits for quorum, or just approves immediately if non-critical). 2) Is the APPROVE_COMMIT message sent after every transaction, or every >>> batch of transactions? >>> >> >> Actually... today we use a "selective synchronization" meaning we only >> actually wait for the commit to be approved by a quorum of the cluster on >> certain, high-sensitivity operations (eg, moving money). This way the >> master can "race ahead" without waiting on most transactions, and we allow >> it to get up to 100 transactions ahead of slaves before waiting for a >> transaction to be approved. And because today replication is serialized, >> in practice approving any given transaction retroactively approves >> committing all past transactions. (This means the master can "fork" from >> the cluster in some scenarios, but it gets kicked out automatically and we >> can apply corrective actions if necessary.) >> > > If I understand the above, it sounds like if the leader crashes, you can > lose up to the most 100 recent transactions worth of data. Is this correct? > In theory, yes. In practice, we deploy two servers in each datacenter (across three geographically-distributed datacenters, so six nodes total in the cluster -- with one configured as a "permaslave" that doesn't participate in quorum). Given this, the node that's in the same datacenter generally gets most if not all of the master's transactions in a real world crash scenario. > However, for the solution we're discussing here, we'd need to remove >> selective synchronization and do a separate APPROVE_COMMIT for every single >> transaction. That by itself would come with an enormous performance >> *reduction*, because approving every single transaction when serialized >> would reduce throughput to the inverse of the median latency of the cluster >> (to get quorum). However, I'm working on a way to run the approval logic >> in parallel, to get the benefit of separate approvals for each >> transactions, without the commensurate drop in replication performance. >> > > Can you elaborate on how running the approval logic in parallel would > increase throughput? > Well, right now each node has a single thread actually doing the writing. Were it not for selective synchronization, the master's write thread would spend most of its time idle, just waiting for a response from half of the slaves. This is because right now, the master only replicates one transaction at a time -- it either approves that transaction, or rejects it (which never actually happens), but either way it needs to wait for that conclusion before even starting to process the next write command. What I'm proposing is that instead of sitting idle and waiting for the result of the "current" distributed transaction, it should just instead set that database handle aside (along with the uncommitted transaction it contains), and then open a new database handle (to start a new write transaction) so it can get started on the next command. If we say 50% of the write thread's time is just waiting for a response, this way the write thread could process 2x as many write commands. However, in practice the thread is actually far less active than 50% when waiting for "full approval" for every command. So a single write thread can actually process something like 50x more transactions if it doesn't need to wait for all slaves to approve each. Accordingly, parallel transaction approval would give the same performance advantages of selective synchronization, but while obtaining full approval for each before committing on the master -- which would be necessary to identify conflicting transactions. > 4) In the case of a ROLLBACK, what do you rollback to? In my previous >>> example, even if you ROLLBACK the update (2) on the follower node after >>> committing the insert, this protocol is not guaranteed to make forward >>> progress since you would have to rollback the insert as well (which you >>> have already committed). >>> >> >> The only transactions that would be rolled back are those that touch a >> different set of pages than the master. So in the case above, the INSERT >> would always commit (because the pages it touch are the same regardless of >> whether it's before or after the UPDATE), and the UPDATE would roll back >> for *everyone* if *any* of the slaves attempted to commit it in a different >> order than the master. But if the master does the UPDATE before the INSERT >> -- and all the slaves follow suit -- then it'd go through just fine. (And >> if the master did the UPDATE *after* the INSERT, so long as all the slaves >> did the same, again, it'd work.) The only time it gets rolled back is if a >> slave commits in a different order than the master. >> > > I see a potential problem here. If each follower applies a batch in a > non-deterministic order, the chance that *any* follower applies them in > some conflicting order increases with the number of followers. > > For example, if we have a set of transactions where 2 conflict, and we > assume that secondaries choose an order at random in which to apply them, > then the probability of a transaction getting applied on a different order > on a given secondary is 50%. Therefore the probability of at least one > secondary getting rolled back is (1 - 1/(2^n)). > > So for a cluster of say ~9 nodes (so 8 secondary nodes), the chance of a > batch with 2 conflicting transactions failing to commit is 99.6%. This > would seem to prevent the cluster from making forward progress. > I agree with that -- odds are, if two transactions are in fact conflicting, at least one of the slaves will likely commit it in a different order than on the master, causing the transaction to be rejected by the master and retried later. However, I'm not sure why this is necessarily a problem. The cluster will still make forward progress on all of the non-conflicting transactions -- only the one that conflicted will be rejected (and quickly retried). Am I misunderstanding your concern? -david
_______________________________________________ p2p-hackers mailing list [email protected] http://lists.zooko.com/mailman/listinfo/p2p-hackers
