Thanks for this response!  However, for this line:

> Oracle offers multimaster replication, but from their documentation it
appears to provide AP guarantees when it sounds like your requirements
dictate a CP system.

Can you clarify what AP versus CP means?

As for sharding, unfortunately that doesn't really work for us.  That would
dramatically simplify things, but Expensify is more like a social network
than a typical enterprise application: anybody can submit expenses to
anybody else.  This means we have a single consolidated datastore for all
users.  That said, I'm not looking to build a full multi-master environment
(yet).  I'm just looking to do multi-threaded writes orchestrated by a
single master.

How does Oracle or MySQL do this when distributed transactions are used?
They clearly allow multi-threaded writes on the master (ie, it's fine to
have two write transactions so long as they don't conflict).  So I'm
*guessing* they do basically the same thing I'm considering: allow slaves
to commit non-conflicting transactions in any order, perhaps by using a
"batchID" that increments every time someone commits a new transaction that
overwrites the data read/written by a previous transaction.

But I'm curious if anybody knows this for real, as I'm just guessing.
Thanks!

-david



On Sun, Apr 17, 2016 at 6:53 PM, Adam Midvidy <[email protected]> wrote:

> Hi David,
>
> Unfortunately, the problem you touched on here really hasn't been "solved"
> in the context of ACID relational databases. Oracle offers multimaster
> replication, but from their documentation
> <https://docs.oracle.com/cd/B28359_01/server.111/b28326/repconflicts.htm#i26513>
> it appears to provide AP guarantees when it sounds like your requirements
> dictate a CP system.
>
> For CP systems, the most popular approach to write scalability, used in
> e.g. Megastore
> <http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36971.pdf>,
> is to partition the data in to shards ("Entity Groups" in the Megastore
> paper) each having an authoritative master. Rather than trying to apply
> non-conflicting replication operations in parallel as you mentioned, these
> systems maintain multiple disjoint (serialized) commit logs, which has the
> same outcome since entries can be appended and applied from different
> commit logs without coordination.
>
> The data is partitioned by choosing an application specific 'shard key'
> and assigning shards non-overlapping ranges of the data.  Perhaps for
> Expensify, a plausible 'shard key' would be organization, as transactions
> are probably confined as such.
>
> Each shard maintains its own Paxos leader that is responsible for
> serializing commits to that shard. To increase CPU utilization, you can
> have each server actually run a set of 'virtual nodes' ( so that each CPU
> is used to process write transactions for a different subset of the data.
>
> A major challenge with this approach is rebalancing shards in response to
> uneven data distribution, but this can be somewhat ameliorated by hashing
> the shard key before partitioning. However, hashing the shard key does come
> at the cost of making cross-shard range queries more costly.
>
> Adam
>
>
>
>
>
> On Sat, Apr 16, 2016 at 10:43 PM, David Barrett <[email protected]>
> wrote:
>
>> I have a auto-reconfiguring distributed transaction layer atop sqlite
>> that uses a paxos-style distributed consensus algorithm to elect a master.
>> It works great and powers all of Expensify.  Contrary to popular belief,
>> sqlite is *freaking amazing* for this, and is blazing fast -- including for
>> multi-threaded concurrent reads.
>>
>> However, a limitation of our replication layer is that it serialises all
>> commits, to ensure that all nodes commit all transactions in the same order
>> -- a brute force way to ensure they all stay in sync.  But this means we
>> only have a single CPU committing transactions on each server (though all
>> are available for reads).
>>
>> This works a lot better than you'd expect, and has scaled to millions of
>> users without trouble.  But it won't scale forever, so we're working on
>> adding concurrent writes as well as reads
>>
>> Luckily sqlite actually has the ability to do page-level locking, and
>> thus support concurrent writes just fine.  But the replication layer itself
>> is still serialized, sending transactions to slaves in the exact order they
>> should be committed.
>>
>> My plan to overcome this is to detect groups of non-conflicting write
>> queries, and flag them in such a fashion that slaves can commit them in
>> parallel in any order.
>>
>> But -- and here's my question -- I'd love to learn more about how other
>> systems have solved this same problem, especially other acid relational
>> databases.
>>
>> Thanks!
>>
>> David
>>
>> _______________________________________________
>> p2p-hackers mailing list
>> [email protected]
>> http://lists.zooko.com/mailman/listinfo/p2p-hackers
>>
>>
>
> _______________________________________________
> p2p-hackers mailing list
> [email protected]
> http://lists.zooko.com/mailman/listinfo/p2p-hackers
>
>
_______________________________________________
p2p-hackers mailing list
[email protected]
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to