On Mon, Apr 18, 2016 at 1:59 AM, David Barrett <[email protected]> wrote:
> Thanks for this response! However, for this line: > > > Oracle offers multimaster replication, but from their documentation it > appears to provide AP guarantees when it sounds like your requirements > dictate a CP system. > > Can you clarify what AP versus CP means? > The CAP theorem <https://en.wikipedia.org/wiki/CAP_theorem> shows that a distributed database can provide either availability or consistency. CP systems provide consistency, that is (when properly implemented) they maintain linearizable <https://en.wikipedia.org/wiki/Linearizability> read/write semantics from the POV of an external client. These semantics come at the cost of rejecting client operations during certain network partitions. Conversely, any node in an AP system will always accept a write or service a read, but clients can read conflicting values from different nodes. >From your prior description, it seems that your system is designed to provide CP semantics. > As for sharding, unfortunately that doesn't really work for us. That > would dramatically simplify things, but Expensify is more like a social > network than a typical enterprise application: anybody can submit expenses > to anybody else. This means we have a single consolidated datastore for > all users. That said, I'm not looking to build a full multi-master > environment (yet). I'm just looking to do multi-threaded writes > orchestrated by a single master. > Having used Expensify at a prior job (great work by the way!), it seemed to me like the data was split by organization. I.E. users from Company A do not file expense reports to be approved by someone from Company B, etc. As such I suggested that you could partition the data by organization. However, I understand have a naive perspective of your application's architecture, so I expect I could be wrong about this. > > How does Oracle or MySQL do this when distributed transactions are used? > They clearly allow multi-threaded writes on the master (ie, it's fine to > have two write transactions so long as they don't conflict). So I'm > *guessing* they do basically the same thing I'm considering: allow slaves > to commit non-conflicting transactions in any order, perhaps by using a > "batchID" that increments every time someone commits a new transaction that > overwrites the data read/written by a previous transaction. > MySQL secondaries apply replication updates to different databases in parallel <http://d2-systems.blogspot.com/2011/04/mysql-56x-feature-preview-multi.html>. This is essentially a form of sharding, as updates within a database are applied serially. Information on Oracle is a bit harder to come by because they are closed source and they have multiple differing replication offerings. Oracle RAC <https://docs.oracle.com/cd/B28359_01/rac.111/b28254/admcon.htm> relies on each replica being backed by a shared SAN array, which pushes synchronization concerns down to the storage layer. Oracle also offers some sort of CP multi-master replication where "the database administrator is responsible for resolving the data conflict manually. <https://docs.oracle.com/cd/B28359_01/server.111/b28326/repmaster.htm#i33607> " > But I'm curious if anybody knows this for real, as I'm just guessing. > Thanks! > > -david > I can't think of any CP systems used in industry that use your idea of batching non-conflicting replication updates and allowing replicas to apply them in parallel. Some AP systems use similar techniques to this, i.e. optimistic replication <https://pages.lip6.fr/Marc.Shapiro/papers/Optimistic_Replication_Computing_Surveys_2005-03_cameraready.pdf> . The reason why this approach isn't popular is that efficiently detecting non-conflicting transactions is actually not a trivial task by any means - and that when the workload is composed of non-conflicting transactions, partitioning (sharding) provides the same performance benefits with less complexity. There has been work done in this area in academia - in this paper <http://arxiv.org/pdf/1311.6183.pdf>, the authors present P-SMR, a technique to "to deliver commands across replicas in total order but allow them to execute concurrently when possible". If you read their evaluation section however, they get (by far) the best scalability improvements when the workload is comprised of completely independent transactions, which is again, no better than sharding. Another system from academia worth looking at is Eve <http://web.cse.ohio-state.edu/~yangwang/osdi12-final190.pdf>, which improves on the throughput of P-SMR by optimistically executing transactions in parallel, then comparing hashes of the result and rolling back to a consistent state if necessary. While Eve achieves impressively high throughput, it does so at the cost of higher system complexity, and unpredictable latency as a result of probabilistic rollbacks. Hope this helps, Adam > > > On Sun, Apr 17, 2016 at 6:53 PM, Adam Midvidy <[email protected]> wrote: > >> Hi David, >> >> Unfortunately, the problem you touched on here really hasn't been >> "solved" in the context of ACID relational databases. Oracle offers >> multimaster replication, but from their documentation >> <https://docs.oracle.com/cd/B28359_01/server.111/b28326/repconflicts.htm#i26513> >> it appears to provide AP guarantees when it sounds like your requirements >> dictate a CP system. >> >> For CP systems, the most popular approach to write scalability, used in >> e.g. Megastore >> <http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36971.pdf>, >> is to partition the data in to shards ("Entity Groups" in the Megastore >> paper) each having an authoritative master. Rather than trying to apply >> non-conflicting replication operations in parallel as you mentioned, these >> systems maintain multiple disjoint (serialized) commit logs, which has the >> same outcome since entries can be appended and applied from different >> commit logs without coordination. >> >> The data is partitioned by choosing an application specific 'shard key' >> and assigning shards non-overlapping ranges of the data. Perhaps for >> Expensify, a plausible 'shard key' would be organization, as transactions >> are probably confined as such. >> >> Each shard maintains its own Paxos leader that is responsible for >> serializing commits to that shard. To increase CPU utilization, you can >> have each server actually run a set of 'virtual nodes' ( so that each CPU >> is used to process write transactions for a different subset of the data. >> >> A major challenge with this approach is rebalancing shards in response to >> uneven data distribution, but this can be somewhat ameliorated by hashing >> the shard key before partitioning. However, hashing the shard key does come >> at the cost of making cross-shard range queries more costly. >> >> Adam >> >> >> >> >> >> On Sat, Apr 16, 2016 at 10:43 PM, David Barrett <[email protected]> >> wrote: >> >>> I have a auto-reconfiguring distributed transaction layer atop sqlite >>> that uses a paxos-style distributed consensus algorithm to elect a master. >>> It works great and powers all of Expensify. Contrary to popular belief, >>> sqlite is *freaking amazing* for this, and is blazing fast -- including for >>> multi-threaded concurrent reads. >>> >>> However, a limitation of our replication layer is that it serialises all >>> commits, to ensure that all nodes commit all transactions in the same order >>> -- a brute force way to ensure they all stay in sync. But this means we >>> only have a single CPU committing transactions on each server (though all >>> are available for reads). >>> >>> This works a lot better than you'd expect, and has scaled to millions of >>> users without trouble. But it won't scale forever, so we're working on >>> adding concurrent writes as well as reads >>> >>> Luckily sqlite actually has the ability to do page-level locking, and >>> thus support concurrent writes just fine. But the replication layer itself >>> is still serialized, sending transactions to slaves in the exact order they >>> should be committed. >>> >>> My plan to overcome this is to detect groups of non-conflicting write >>> queries, and flag them in such a fashion that slaves can commit them in >>> parallel in any order. >>> >>> But -- and here's my question -- I'd love to learn more about how other >>> systems have solved this same problem, especially other acid relational >>> databases. >>> >>> Thanks! >>> >>> David >>> >>> _______________________________________________ >>> p2p-hackers mailing list >>> [email protected] >>> http://lists.zooko.com/mailman/listinfo/p2p-hackers >>> >>> >> >> _______________________________________________ >> p2p-hackers mailing list >> [email protected] >> http://lists.zooko.com/mailman/listinfo/p2p-hackers >> >> > > _______________________________________________ > p2p-hackers mailing list > [email protected] > http://lists.zooko.com/mailman/listinfo/p2p-hackers > >
_______________________________________________ p2p-hackers mailing list [email protected] http://lists.zooko.com/mailman/listinfo/p2p-hackers
