Re: [HACKERS] Horizontal scalability/sharding

Tomas Vondra Tue, 01 Sep 2015 14:30:35 -0700

Hi,

On 09/01/2015 09:19 PM, Josh Berkus wrote:

On 09/01/2015 11:36 AM, Tomas Vondra wrote:

We want multiple copies of shards created by the sharding system
itself. Having a separate, and completely orthagonal, redundancy
system to the sharding system is overly burdensome on the DBA and
makes low-data-loss HA impossible.


IMHO it'd be quite unfortunate if the design would make it
impossible to combine those two features (e.g. creating standbys
for shards and failing over to them).

It's true that solving HA at the sharding level (by keeping
multiple copies of a each shard) may be simpler than combining
sharding and standbys, but I don't see why it makes low-data-loss
HA impossible.


Other way around, that is, having replication standbys as the only
method of redundancy requires either high data loss or high latency
for all writes.

I haven't said that. I said that we should allow that topology, not thatit should be the only method of redundancy.


In the case of async rep, every time we fail over a node, the entire
cluser would need to roll back to the last common known-good replay
point, hence high data loss.

In the case of sync rep, we are required to wait for at least double
 network lag time in order to do a single write ... making
write-scalability quite difficult.

Which assumes that latency (or rather the increase due to syncrep) is aproblem for the use case. Which may be the case for many use cases, butcertainly is not a problem for many BI/DWH use cases performing mostlylarge batch loads. In those cases the network bandwidth may be quiteimportant resource.

For example assume that there are just two shards in two separate datacenters, connected by a link with limited bandwidth. Now, let's assumeyou always keep a local replica for failover. So you have A1+A2 in DC1,B1+B2 in DC2. If you're in DC1, then writing data to B1 means you alsohave to write data to B2 and wait for it. So either you send the data toeach node separately (consuming 2x the bandwidth), or send it to B1 andlet it propagate to B2 e.g. through sync rep.

So while you may be right in single-DC deployments, with multi-DCdeployments the situation is quite different - not only that the networkbandwidth is not unlimited, but because latencies within DC may be afraction of latencies between the locations (to the extent that theincrease due to syncrep may be just noise). So the local replication maybe actually way faster.

I can imagine forwarding the data between B1 and B2 even with a purelysharding solution, but at that point you effectively re-implemented syncrep.

IMHO the design has to address the multi-DC setups somehow. I think thatmany of the customers who are so concerned about scaling to many shardsare also concerned about availability in case of DC outages, no?

We should also consider support for custom topologies (not just a fullmesh, or whatever we choose as the default/initial topology), which issomehow related.


Futher, if using replication the sharding system would have no way
to (a) find out immediately if a copy was bad and (b) fail over
quickly to a copy of the shard if the first requested copy was not
responding. With async replication, we also can't use multiple copies
of the same shard as a way to balance read workloads.

I don't follow. With sync rep we do know whether the copy is OK or not,because the node either confirms writes or not. The failover certainlyis more complicated and is not immediate (to the extent of keeping acopy at the sharding level), but it's a question of trade-offs.

It's true we don't have auto-failover solution at the moment, but as Isaid - I can easily imagine most people using just sharding, while somedeployments use syncrep with manual failover.


If we write to multiple copies as a part of the sharding feature,
then that can be parallelized, so that we are waiting only as long as
the slowest write (or in failure cases, as long as the shard
timeout). Further, we can check for shard-copy health and update
shard availability data with each user request, so that the ability
to see stale/bad data is minimized.


Again, this assumes infinite network bandwidth.


There are obvious problems with multiplexing writes, which you can
figure out if you knock pg_shard around a bit. But I really think
that solving those problems is the only way to go.

Mind you, I see a strong place for binary replication and BDR for
multi-region redundancy; you really don't want that to be part of
the sharding system if you're aiming for write scalability.

I haven't mentioned BDR at all, and given the async nature I don't havea clear idea of how it fits into the sharding world at this point.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

Reply via email to