Hi,

On 09/01/2015 09:19 PM, Josh Berkus wrote:
On 09/01/2015 11:36 AM, Tomas Vondra wrote:
We want multiple copies of shards created by the sharding system
itself. Having a separate, and completely orthagonal, redundancy
system to the sharding system is overly burdensome on the DBA and
makes low-data-loss HA impossible.

IMHO it'd be quite unfortunate if the design would make it
impossible to combine those two features (e.g. creating standbys
for shards and failing over to them).

It's true that solving HA at the sharding level (by keeping
multiple copies of a each shard) may be simpler than combining
sharding and standbys, but I don't see why it makes low-data-loss
HA impossible.

Other way around, that is, having replication standbys as the only
method of redundancy requires either high data loss or high latency
for all writes.

I haven't said that. I said that we should allow that topology, not that it should be the only method of redundancy.


In the case of async rep, every time we fail over a node, the entire
cluser would need to roll back to the last common known-good replay
point, hence high data loss.

In the case of sync rep, we are required to wait for at least double
 network lag time in order to do a single write ... making
write-scalability quite difficult.

Which assumes that latency (or rather the increase due to syncrep) is a problem for the use case. Which may be the case for many use cases, but certainly is not a problem for many BI/DWH use cases performing mostly large batch loads. In those cases the network bandwidth may be quite important resource.

For example assume that there are just two shards in two separate data centers, connected by a link with limited bandwidth. Now, let's assume you always keep a local replica for failover. So you have A1+A2 in DC1, B1+B2 in DC2. If you're in DC1, then writing data to B1 means you also have to write data to B2 and wait for it. So either you send the data to each node separately (consuming 2x the bandwidth), or send it to B1 and let it propagate to B2 e.g. through sync rep.

So while you may be right in single-DC deployments, with multi-DC deployments the situation is quite different - not only that the network bandwidth is not unlimited, but because latencies within DC may be a fraction of latencies between the locations (to the extent that the increase due to syncrep may be just noise). So the local replication may be actually way faster.

I can imagine forwarding the data between B1 and B2 even with a purely sharding solution, but at that point you effectively re-implemented syncrep.

IMHO the design has to address the multi-DC setups somehow. I think that many of the customers who are so concerned about scaling to many shards are also concerned about availability in case of DC outages, no?

We should also consider support for custom topologies (not just a full mesh, or whatever we choose as the default/initial topology), which is somehow related.


Futher, if using replication the sharding system would have no way
to (a) find out immediately if a copy was bad and (b) fail over
quickly to a copy of the shard if the first requested copy was not
responding. With async replication, we also can't use multiple copies
of the same shard as a way to balance read workloads.

I don't follow. With sync rep we do know whether the copy is OK or not, because the node either confirms writes or not. The failover certainly is more complicated and is not immediate (to the extent of keeping a copy at the sharding level), but it's a question of trade-offs.

It's true we don't have auto-failover solution at the moment, but as I said - I can easily imagine most people using just sharding, while some deployments use syncrep with manual failover.


If we write to multiple copies as a part of the sharding feature,
then that can be parallelized, so that we are waiting only as long as
the slowest write (or in failure cases, as long as the shard
timeout). Further, we can check for shard-copy health and update
shard availability data with each user request, so that the ability
to see stale/bad data is minimized.

Again, this assumes infinite network bandwidth.


There are obvious problems with multiplexing writes, which you can
figure out if you knock pg_shard around a bit. But I really think
that solving those problems is the only way to go.

Mind you, I see a strong place for binary replication and BDR for
multi-region redundancy; you really don't want that to be part of
the sharding system if you're aiming for write scalability.

I haven't mentioned BDR at all, and given the async nature I don't have a clear idea of how it fits into the sharding world at this point.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to