On Wed, Apr 20, 2016 at 9:20 PM, Adam Midvidy <[email protected]> wrote:

> What are the consistency guarantees for reads from followers? In
> particular can a client read data from a secondary node written by
> transaction that has not yet been committed by a quorum of the cluster?
>

To clarify, our server (called Bedrock) has three layers of significance to
this discussion -- a "Plugin" layer, a "cluster" layer, and the "SQLite"
layer.

The Plugin layer consists of a series of commands -- written in C++ -- that
execute several queries on the local database inside a transaction.  In a
sense, each plugin consists of a series of stored procedures.  Some of
those stored procedures (which I'll call "transactions" for simplicity)
consist exclusively of reads are executed on the slave, while those that
have at least one write query are escalated to the master and executed
there.

The Cluster layer consists of the Paxos-based distributed transaction
layer.  It picks a master, handles the escalation, etc.

Finally, the SQLite layer is just the sqlite library.  You'll recall sqlite
has no native networking capabilities, and instead talks to the Cluster and
Plugin layers.

So, with that context, today there are multiple read threads and a single
write thread, on both the master and slaves.  Each read and write thread
has its own database handle, and every thread currently processes at most
one transaction at a time.  The read threads begin transactions, but always
roll them back (because they don't actually execute any write queries).
The write thread's behavior differs based on whether it's the master or a
slave.  On the master, that transaction consists of the logic of the
Plugin's stored procedure being executed.  On the slave, however, that
write transaction consists of the outstanding transaction being approved
via the two-phase commit protocol.

During this period, the read threads (on both master and slaves) continue
to process read transactions, but the read threads don't "see" the results
of the write thread (whether on the master or the slave) until it's been
approved.

So, long story short, clients cannot read data until it's been approved by
the master (which either waits for quorum, or just approves immediately if
non-critical).


2) Is the APPROVE_COMMIT message sent after every transaction, or every
>>> batch of transactions?
>>>
>>
>> Actually... today we use a "selective synchronization" meaning we only
>> actually wait for the commit to be approved by a quorum of the cluster on
>> certain, high-sensitivity operations (eg, moving money).  This way the
>> master can "race ahead" without waiting on most transactions, and we allow
>> it to get up to 100 transactions ahead of slaves before waiting for a
>> transaction to be approved.  And because today replication is serialized,
>> in practice approving any given transaction retroactively approves
>> committing all past transactions.  (This means the master can "fork" from
>> the cluster in some scenarios, but  it gets kicked out automatically and we
>> can apply corrective actions if necessary.)
>>
>
> If I understand the above, it sounds like if the leader crashes, you can
> lose up to the most 100 recent transactions worth of data. Is this correct?
>

In theory, yes.  In practice, we deploy two servers in each datacenter
(across three geographically-distributed datacenters, so six nodes total in
the cluster -- with one configured as a "permaslave" that doesn't
participate in quorum).  Given this, the node that's in the same datacenter
generally gets most if not all of the master's transactions in a real world
crash scenario.



> However, for the solution we're discussing here, we'd need to remove
>> selective synchronization and do a separate APPROVE_COMMIT for every single
>> transaction.  That by itself would come with an enormous performance
>> *reduction*, because approving every single transaction when serialized
>> would reduce throughput to the inverse of the median latency of the cluster
>> (to get quorum).  However, I'm working on a way to run the approval logic
>> in parallel, to get the benefit of separate approvals for each
>> transactions, without the commensurate drop in replication performance.
>>
>
> Can you elaborate on how running the approval logic in parallel would
> increase throughput?
>

Well, right now each node has a single thread actually doing the writing.
Were it not for selective synchronization, the master's write thread would
spend most of its time idle, just waiting for a response from half of the
slaves.  This is because right now, the master only replicates one
transaction at a time -- it either approves that transaction, or rejects it
(which never actually happens), but either way it needs to wait for that
conclusion before even starting to process the next write command.

What I'm proposing is that instead of sitting idle and waiting for the
result of the "current" distributed transaction, it should just instead set
that database handle aside (along with the uncommitted transaction it
contains), and then open a new database handle (to start a new write
transaction) so it can get started on the next command.  If we say 50% of
the write thread's time is just waiting for a response, this way the write
thread could process 2x as many write commands.

However, in practice the thread is actually far less active than 50% when
waiting for "full approval" for every command.  So a single write thread
can actually process something like 50x more transactions if it doesn't
need to wait for all slaves to approve each.  Accordingly, parallel
transaction approval would give the same performance advantages of
selective synchronization, but while obtaining full approval for each
before committing on the master -- which would be necessary to identify
conflicting transactions.



> 4) In the case of a ROLLBACK, what do you rollback to? In my previous
>>> example, even if you ROLLBACK the update (2) on the follower node after
>>> committing the insert, this protocol is not guaranteed to make forward
>>> progress since you would have to rollback the insert as well (which you
>>> have already committed).
>>>
>>
>> The only transactions that would be rolled back are those that touch a
>> different set of pages than the master.  So in the case above, the INSERT
>> would always commit (because the pages it touch are the same regardless of
>> whether it's before or after the UPDATE), and the UPDATE would roll back
>> for *everyone* if *any* of the slaves attempted to commit it in a different
>> order than the master.  But if the master does the UPDATE before the INSERT
>> -- and all the slaves follow suit -- then it'd go through just fine.  (And
>> if the master did the UPDATE *after* the INSERT, so long as all the slaves
>> did the same, again, it'd work.)  The only time it gets rolled back is if a
>> slave commits in a different order than the master.
>>
>
> I see a potential problem here. If each follower applies a batch in a
> non-deterministic order, the chance that *any* follower applies them in
> some conflicting order increases with the number of followers.
>
> For example, if we have a set of transactions where 2 conflict, and we
> assume that secondaries choose an order at random in which to apply them,
> then the probability of a transaction getting applied on a different order
> on a given secondary is 50%. Therefore the probability of at least one
> secondary getting rolled back is (1 - 1/(2^n)).
>
> So for a cluster of say ~9 nodes (so 8 secondary nodes), the chance of a
> batch with 2 conflicting transactions failing to commit is 99.6%. This
> would seem to prevent the cluster from making forward progress.
>

I agree with that -- odds are, if two transactions are in fact conflicting,
at least one of the slaves will likely commit it in a different order than
on the master, causing the transaction to be rejected by the master and
retried later.  However, I'm not sure why this is necessarily a problem.
The cluster will still make forward progress on all of the non-conflicting
transactions -- only the one that conflicted will be rejected (and quickly
retried).

Am I misunderstanding your concern?

-david
_______________________________________________
p2p-hackers mailing list
[email protected]
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to