Hi David, Unfortunately, the problem you touched on here really hasn't been "solved" in the context of ACID relational databases. Oracle offers multimaster replication, but from their documentation <https://docs.oracle.com/cd/B28359_01/server.111/b28326/repconflicts.htm#i26513> it appears to provide AP guarantees when it sounds like your requirements dictate a CP system.
For CP systems, the most popular approach to write scalability, used in e.g. Megastore <http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36971.pdf>, is to partition the data in to shards ("Entity Groups" in the Megastore paper) each having an authoritative master. Rather than trying to apply non-conflicting replication operations in parallel as you mentioned, these systems maintain multiple disjoint (serialized) commit logs, which has the same outcome since entries can be appended and applied from different commit logs without coordination. The data is partitioned by choosing an application specific 'shard key' and assigning shards non-overlapping ranges of the data. Perhaps for Expensify, a plausible 'shard key' would be organization, as transactions are probably confined as such. Each shard maintains its own Paxos leader that is responsible for serializing commits to that shard. To increase CPU utilization, you can have each server actually run a set of 'virtual nodes' ( so that each CPU is used to process write transactions for a different subset of the data. A major challenge with this approach is rebalancing shards in response to uneven data distribution, but this can be somewhat ameliorated by hashing the shard key before partitioning. However, hashing the shard key does come at the cost of making cross-shard range queries more costly. Adam On Sat, Apr 16, 2016 at 10:43 PM, David Barrett <[email protected]> wrote: > I have a auto-reconfiguring distributed transaction layer atop sqlite that > uses a paxos-style distributed consensus algorithm to elect a master. It > works great and powers all of Expensify. Contrary to popular belief, > sqlite is *freaking amazing* for this, and is blazing fast -- including for > multi-threaded concurrent reads. > > However, a limitation of our replication layer is that it serialises all > commits, to ensure that all nodes commit all transactions in the same order > -- a brute force way to ensure they all stay in sync. But this means we > only have a single CPU committing transactions on each server (though all > are available for reads). > > This works a lot better than you'd expect, and has scaled to millions of > users without trouble. But it won't scale forever, so we're working on > adding concurrent writes as well as reads > > Luckily sqlite actually has the ability to do page-level locking, and thus > support concurrent writes just fine. But the replication layer itself is > still serialized, sending transactions to slaves in the exact order they > should be committed. > > My plan to overcome this is to detect groups of non-conflicting write > queries, and flag them in such a fashion that slaves can commit them in > parallel in any order. > > But -- and here's my question -- I'd love to learn more about how other > systems have solved this same problem, especially other acid relational > databases. > > Thanks! > > David > > _______________________________________________ > p2p-hackers mailing list > [email protected] > http://lists.zooko.com/mailman/listinfo/p2p-hackers > >
_______________________________________________ p2p-hackers mailing list [email protected] http://lists.zooko.com/mailman/listinfo/p2p-hackers
