Hi David,

Unfortunately, the problem you touched on here really hasn't been "solved"
in the context of ACID relational databases. Oracle offers multimaster
replication, but from their documentation
<https://docs.oracle.com/cd/B28359_01/server.111/b28326/repconflicts.htm#i26513>
it appears to provide AP guarantees when it sounds like your requirements
dictate a CP system.

For CP systems, the most popular approach to write scalability, used in
e.g. Megastore
<http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36971.pdf>,
is to partition the data in to shards ("Entity Groups" in the Megastore
paper) each having an authoritative master. Rather than trying to apply
non-conflicting replication operations in parallel as you mentioned, these
systems maintain multiple disjoint (serialized) commit logs, which has the
same outcome since entries can be appended and applied from different
commit logs without coordination.

The data is partitioned by choosing an application specific 'shard key' and
assigning shards non-overlapping ranges of the data.  Perhaps for
Expensify, a plausible 'shard key' would be organization, as transactions
are probably confined as such.

Each shard maintains its own Paxos leader that is responsible for
serializing commits to that shard. To increase CPU utilization, you can
have each server actually run a set of 'virtual nodes' ( so that each CPU
is used to process write transactions for a different subset of the data.

A major challenge with this approach is rebalancing shards in response to
uneven data distribution, but this can be somewhat ameliorated by hashing
the shard key before partitioning. However, hashing the shard key does come
at the cost of making cross-shard range queries more costly.

Adam





On Sat, Apr 16, 2016 at 10:43 PM, David Barrett <[email protected]>
wrote:

> I have a auto-reconfiguring distributed transaction layer atop sqlite that
> uses a paxos-style distributed consensus algorithm to elect a master.  It
> works great and powers all of Expensify.  Contrary to popular belief,
> sqlite is *freaking amazing* for this, and is blazing fast -- including for
> multi-threaded concurrent reads.
>
> However, a limitation of our replication layer is that it serialises all
> commits, to ensure that all nodes commit all transactions in the same order
> -- a brute force way to ensure they all stay in sync.  But this means we
> only have a single CPU committing transactions on each server (though all
> are available for reads).
>
> This works a lot better than you'd expect, and has scaled to millions of
> users without trouble.  But it won't scale forever, so we're working on
> adding concurrent writes as well as reads
>
> Luckily sqlite actually has the ability to do page-level locking, and thus
> support concurrent writes just fine.  But the replication layer itself is
> still serialized, sending transactions to slaves in the exact order they
> should be committed.
>
> My plan to overcome this is to detect groups of non-conflicting write
> queries, and flag them in such a fashion that slaves can commit them in
> parallel in any order.
>
> But -- and here's my question -- I'd love to learn more about how other
> systems have solved this same problem, especially other acid relational
> databases.
>
> Thanks!
>
> David
>
> _______________________________________________
> p2p-hackers mailing list
> [email protected]
> http://lists.zooko.com/mailman/listinfo/p2p-hackers
>
>
_______________________________________________
p2p-hackers mailing list
[email protected]
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to