On Mon, Apr 18, 2016 at 1:59 AM, David Barrett <[email protected]>
wrote:

> Thanks for this response!  However, for this line:
>
> > Oracle offers multimaster replication, but from their documentation it
> appears to provide AP guarantees when it sounds like your requirements
> dictate a CP system.
>
> Can you clarify what AP versus CP means?
>

The CAP theorem <https://en.wikipedia.org/wiki/CAP_theorem> shows that a
distributed database can provide either availability or consistency. CP
systems provide consistency, that is (when properly implemented) they
maintain linearizable <https://en.wikipedia.org/wiki/Linearizability>
read/write semantics from the POV of an external client. These semantics
come at the cost of rejecting client operations during certain network
partitions. Conversely, any node in an AP system will always accept a write
or service a read, but clients can read conflicting values from different
nodes.

>From your prior description, it seems that your system is designed to
provide CP semantics.


> As for sharding, unfortunately that doesn't really work for us.  That
> would dramatically simplify things, but Expensify is more like a social
> network than a typical enterprise application: anybody can submit expenses
> to anybody else.  This means we have a single consolidated datastore for
> all users.  That said, I'm not looking to build a full multi-master
> environment (yet).  I'm just looking to do multi-threaded writes
> orchestrated by a single master.
>

Having used Expensify at a prior job (great work by the way!), it seemed to
me like the data was split by organization. I.E. users from Company A do
not file expense reports to be approved by someone from Company B, etc. As
such I suggested that you could partition the data by organization.
However, I understand have a naive perspective of your application's
architecture, so I expect I could be wrong about this.


>
> How does Oracle or MySQL do this when distributed transactions are used?
> They clearly allow multi-threaded writes on the master (ie, it's fine to
> have two write transactions so long as they don't conflict).  So I'm
> *guessing* they do basically the same thing I'm considering: allow slaves
> to commit non-conflicting transactions in any order, perhaps by using a
> "batchID" that increments every time someone commits a new transaction that
> overwrites the data read/written by a previous transaction.
>

MySQL secondaries apply replication updates to different databases in
parallel
<http://d2-systems.blogspot.com/2011/04/mysql-56x-feature-preview-multi.html>.
This is essentially a form of sharding, as updates within a database are
applied serially.

Information on Oracle is a bit harder to come by because they are closed
source and they have multiple differing replication offerings. Oracle RAC
<https://docs.oracle.com/cd/B28359_01/rac.111/b28254/admcon.htm> relies on
each replica being backed by a shared SAN array, which pushes
synchronization concerns down to the storage layer. Oracle also offers some
sort of CP multi-master replication where "the database administrator is
responsible for resolving the data conflict manually.
<https://docs.oracle.com/cd/B28359_01/server.111/b28326/repmaster.htm#i33607>
"


> But I'm curious if anybody knows this for real, as I'm just guessing.
> Thanks!
>
> -david
>

I can't think of any CP systems used in industry that use your idea of
batching non-conflicting replication updates and allowing replicas to apply
them in parallel. Some AP systems use similar techniques to this, i.e.
optimistic
replication
<https://pages.lip6.fr/Marc.Shapiro/papers/Optimistic_Replication_Computing_Surveys_2005-03_cameraready.pdf>
.

The reason why this approach isn't popular is that efficiently detecting
non-conflicting transactions is actually not a trivial task by any means -
and that when the workload is composed of non-conflicting transactions,
partitioning (sharding) provides the same performance benefits with less
complexity.

There has been work done in this area in academia - in this paper
<http://arxiv.org/pdf/1311.6183.pdf>, the authors present P-SMR, a
technique to "to deliver commands across replicas in total order but allow
them to execute concurrently when possible". If you read their evaluation
section however, they get (by far) the best scalability improvements when
the workload is comprised of completely independent transactions, which is
again, no better than sharding.

Another system from academia worth looking at is Eve
<http://web.cse.ohio-state.edu/~yangwang/osdi12-final190.pdf>, which
improves on the throughput of P-SMR by optimistically executing
transactions in parallel, then comparing hashes of the result and rolling
back to a consistent state if necessary. While Eve achieves impressively
high throughput, it does so at the cost of higher system complexity, and
unpredictable latency as a result of probabilistic rollbacks.

Hope this helps,
Adam



>
>
> On Sun, Apr 17, 2016 at 6:53 PM, Adam Midvidy <[email protected]> wrote:
>
>> Hi David,
>>
>> Unfortunately, the problem you touched on here really hasn't been
>> "solved" in the context of ACID relational databases. Oracle offers
>> multimaster replication, but from their documentation
>> <https://docs.oracle.com/cd/B28359_01/server.111/b28326/repconflicts.htm#i26513>
>> it appears to provide AP guarantees when it sounds like your requirements
>> dictate a CP system.
>>
>> For CP systems, the most popular approach to write scalability, used in
>> e.g. Megastore
>> <http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36971.pdf>,
>> is to partition the data in to shards ("Entity Groups" in the Megastore
>> paper) each having an authoritative master. Rather than trying to apply
>> non-conflicting replication operations in parallel as you mentioned, these
>> systems maintain multiple disjoint (serialized) commit logs, which has the
>> same outcome since entries can be appended and applied from different
>> commit logs without coordination.
>>
>> The data is partitioned by choosing an application specific 'shard key'
>> and assigning shards non-overlapping ranges of the data.  Perhaps for
>> Expensify, a plausible 'shard key' would be organization, as transactions
>> are probably confined as such.
>>
>> Each shard maintains its own Paxos leader that is responsible for
>> serializing commits to that shard. To increase CPU utilization, you can
>> have each server actually run a set of 'virtual nodes' ( so that each CPU
>> is used to process write transactions for a different subset of the data.
>>
>> A major challenge with this approach is rebalancing shards in response to
>> uneven data distribution, but this can be somewhat ameliorated by hashing
>> the shard key before partitioning. However, hashing the shard key does come
>> at the cost of making cross-shard range queries more costly.
>>
>> Adam
>>
>>
>>
>>
>>
>> On Sat, Apr 16, 2016 at 10:43 PM, David Barrett <[email protected]>
>> wrote:
>>
>>> I have a auto-reconfiguring distributed transaction layer atop sqlite
>>> that uses a paxos-style distributed consensus algorithm to elect a master.
>>> It works great and powers all of Expensify.  Contrary to popular belief,
>>> sqlite is *freaking amazing* for this, and is blazing fast -- including for
>>> multi-threaded concurrent reads.
>>>
>>> However, a limitation of our replication layer is that it serialises all
>>> commits, to ensure that all nodes commit all transactions in the same order
>>> -- a brute force way to ensure they all stay in sync.  But this means we
>>> only have a single CPU committing transactions on each server (though all
>>> are available for reads).
>>>
>>> This works a lot better than you'd expect, and has scaled to millions of
>>> users without trouble.  But it won't scale forever, so we're working on
>>> adding concurrent writes as well as reads
>>>
>>> Luckily sqlite actually has the ability to do page-level locking, and
>>> thus support concurrent writes just fine.  But the replication layer itself
>>> is still serialized, sending transactions to slaves in the exact order they
>>> should be committed.
>>>
>>> My plan to overcome this is to detect groups of non-conflicting write
>>> queries, and flag them in such a fashion that slaves can commit them in
>>> parallel in any order.
>>>
>>> But -- and here's my question -- I'd love to learn more about how other
>>> systems have solved this same problem, especially other acid relational
>>> databases.
>>>
>>> Thanks!
>>>
>>> David
>>>
>>> _______________________________________________
>>> p2p-hackers mailing list
>>> [email protected]
>>> http://lists.zooko.com/mailman/listinfo/p2p-hackers
>>>
>>>
>>
>> _______________________________________________
>> p2p-hackers mailing list
>> [email protected]
>> http://lists.zooko.com/mailman/listinfo/p2p-hackers
>>
>>
>
> _______________________________________________
> p2p-hackers mailing list
> [email protected]
> http://lists.zooko.com/mailman/listinfo/p2p-hackers
>
>
_______________________________________________
p2p-hackers mailing list
[email protected]
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to