Thanks for this response! However, for this line: > Oracle offers multimaster replication, but from their documentation it appears to provide AP guarantees when it sounds like your requirements dictate a CP system.
Can you clarify what AP versus CP means? As for sharding, unfortunately that doesn't really work for us. That would dramatically simplify things, but Expensify is more like a social network than a typical enterprise application: anybody can submit expenses to anybody else. This means we have a single consolidated datastore for all users. That said, I'm not looking to build a full multi-master environment (yet). I'm just looking to do multi-threaded writes orchestrated by a single master. How does Oracle or MySQL do this when distributed transactions are used? They clearly allow multi-threaded writes on the master (ie, it's fine to have two write transactions so long as they don't conflict). So I'm *guessing* they do basically the same thing I'm considering: allow slaves to commit non-conflicting transactions in any order, perhaps by using a "batchID" that increments every time someone commits a new transaction that overwrites the data read/written by a previous transaction. But I'm curious if anybody knows this for real, as I'm just guessing. Thanks! -david On Sun, Apr 17, 2016 at 6:53 PM, Adam Midvidy <[email protected]> wrote: > Hi David, > > Unfortunately, the problem you touched on here really hasn't been "solved" > in the context of ACID relational databases. Oracle offers multimaster > replication, but from their documentation > <https://docs.oracle.com/cd/B28359_01/server.111/b28326/repconflicts.htm#i26513> > it appears to provide AP guarantees when it sounds like your requirements > dictate a CP system. > > For CP systems, the most popular approach to write scalability, used in > e.g. Megastore > <http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36971.pdf>, > is to partition the data in to shards ("Entity Groups" in the Megastore > paper) each having an authoritative master. Rather than trying to apply > non-conflicting replication operations in parallel as you mentioned, these > systems maintain multiple disjoint (serialized) commit logs, which has the > same outcome since entries can be appended and applied from different > commit logs without coordination. > > The data is partitioned by choosing an application specific 'shard key' > and assigning shards non-overlapping ranges of the data. Perhaps for > Expensify, a plausible 'shard key' would be organization, as transactions > are probably confined as such. > > Each shard maintains its own Paxos leader that is responsible for > serializing commits to that shard. To increase CPU utilization, you can > have each server actually run a set of 'virtual nodes' ( so that each CPU > is used to process write transactions for a different subset of the data. > > A major challenge with this approach is rebalancing shards in response to > uneven data distribution, but this can be somewhat ameliorated by hashing > the shard key before partitioning. However, hashing the shard key does come > at the cost of making cross-shard range queries more costly. > > Adam > > > > > > On Sat, Apr 16, 2016 at 10:43 PM, David Barrett <[email protected]> > wrote: > >> I have a auto-reconfiguring distributed transaction layer atop sqlite >> that uses a paxos-style distributed consensus algorithm to elect a master. >> It works great and powers all of Expensify. Contrary to popular belief, >> sqlite is *freaking amazing* for this, and is blazing fast -- including for >> multi-threaded concurrent reads. >> >> However, a limitation of our replication layer is that it serialises all >> commits, to ensure that all nodes commit all transactions in the same order >> -- a brute force way to ensure they all stay in sync. But this means we >> only have a single CPU committing transactions on each server (though all >> are available for reads). >> >> This works a lot better than you'd expect, and has scaled to millions of >> users without trouble. But it won't scale forever, so we're working on >> adding concurrent writes as well as reads >> >> Luckily sqlite actually has the ability to do page-level locking, and >> thus support concurrent writes just fine. But the replication layer itself >> is still serialized, sending transactions to slaves in the exact order they >> should be committed. >> >> My plan to overcome this is to detect groups of non-conflicting write >> queries, and flag them in such a fashion that slaves can commit them in >> parallel in any order. >> >> But -- and here's my question -- I'd love to learn more about how other >> systems have solved this same problem, especially other acid relational >> databases. >> >> Thanks! >> >> David >> >> _______________________________________________ >> p2p-hackers mailing list >> [email protected] >> http://lists.zooko.com/mailman/listinfo/p2p-hackers >> >> > > _______________________________________________ > p2p-hackers mailing list > [email protected] > http://lists.zooko.com/mailman/listinfo/p2p-hackers > >
_______________________________________________ p2p-hackers mailing list [email protected] http://lists.zooko.com/mailman/listinfo/p2p-hackers
