Re: [HACKERS] Horizontal scalability/sharding

2015-09-08 Thread Merlin Moncure
On Mon, Sep 7, 2015 at 1:33 PM, Ahsan Hadi  wrote:
> I
>
> On Monday, September 7, 2015, Ashutosh Bapat
>  wrote:
>>
>>
>>
>> On Sat, Sep 5, 2015 at 4:22 AM, Ozgun Erdogan  wrote:
>>>
>>> Hey Robert,
>>>
 Now the question is, where should the code that does all of this live?
  postgres_fdw?  Some new, sharding-specific FDW?  In core?  I don't
 know for sure, but what I do know is that we could make a lot of
 progress over where we are today by just improving postgres_fdw, and I
 don't think those improvements are even all that difficult.  If we
 decide we need to implement something new, it's going to be a huge
 project that will take years to complete, with uncertain results.  I'd
 rather have a postgres_fdw-based implementation that is imperfect and
 can't handle some kinds of queries in 9.6 than a promise that by 9.9
 we'll have something really great that handles MPP perfectly.
>>>
>>>
>>> Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs
>>> for pg_shard, we thought that Map/Reduce would require a comprehensive
>>> revamp of the APIs.
>>>
>>> For Citus, a second part of the question is as FDW writers. We
>>> implemented cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't
>>> benefit from even the simple join pushdown that doesn't require Map/Reduce.
>>
>>
>> I didn't get this. Join pushdown infrastructure (chiefly set of hooks
>> provided in join planning paths) is part of 9.5. Isn't that sufficient to
>> implement join push-down for above FDWs? Or FDW writers are facing problems
>> while implementing those hooks. In either case that should be reported on
>> hackers.
>>
>
>
> I don't think any FDW writer (other the postgres_fdw) has tried to implement
> join push down in the respective FDW's using the new API.

Well, 'jdbc_fdw2' seems to implement deparsing at some level:
https://github.com/heimir-sverrisson/jdbc2_fdw/blob/master/deparse.c

...but this was likely a copy/paste job from the postgres_fdw.  It
should not escape note that the deparsing strategy has dependencies on
the optimizer and the parser.  This is not good; asking the FDW
implementations to implement SQL optimizations is not a sustainable
solution.  They should be indicating, for example, "I support ANSI-92
SQL", and the postgres planner should be deparsing the foreign table
definition and rewriting it, not the other way around.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-08 Thread Petr Jelinek

On 2015-09-08 19:52, Robert Haas wrote:

On Fri, Sep 4, 2015 at 6:52 PM, Ozgun Erdogan  wrote:

For Citus, a second part of the question is as FDW writers. We implemented

cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't benefit from
even the simple join pushdown that doesn't require Map/Reduce.

The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these
have support for joins:
https://wiki.postgresql.org/wiki/Foreign_data_wrappers


What do you mean by "support for joins"?  Do you mean that only 18 of
the remote data sources can do joins?  If so, why does that matter?
I'd be quite happy if a join pushdown or "distributed shuffle" API had
as many as 18 users - I'd be quite happy if it had one (postgres_fdw).
The fact that not all FDWs can support every operation because of
limitations on the remote side isn't a reason not to support those
operations when the remote side is capable.



Agreed. While I personally don't think FDWs are long term answer for 
sharding, I do believe that the infrastructure that is being developed 
for FDWs (join/aggregate pushdown) is needed anyway and there are many 
common issues that need solving in this area for FDWs, sharding and 
parallel query.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-08 Thread Robert Haas
On Fri, Sep 4, 2015 at 6:52 PM, Ozgun Erdogan  wrote:
> Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs for
> pg_shard, we thought that Map/Reduce would require a comprehensive revamp of
> the APIs.

Well, so you've said.  But what kind of API do you want to see?
Taking control at some very high-level hook like ExecutorRun() is not
really a maintainable solution - it's fine if you've only got one guy
doing it, perhaps, but if you have several FDWs talking to different
kinds of remote systems, they can't all seize overall control.

>> For Citus, a second part of the question is as FDW writers. We implemented
> cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't benefit from
> even the simple join pushdown that doesn't require Map/Reduce.
>
> The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these
> have support for joins:
> https://wiki.postgresql.org/wiki/Foreign_data_wrappers

What do you mean by "support for joins"?  Do you mean that only 18 of
the remote data sources can do joins?  If so, why does that matter?
I'd be quite happy if a join pushdown or "distributed shuffle" API had
as many as 18 users - I'd be quite happy if it had one (postgres_fdw).
The fact that not all FDWs can support every operation because of
limitations on the remote side isn't a reason not to support those
operations when the remote side is capable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-07 Thread Thomas Munro
On Thu, Sep 3, 2015 at 7:03 AM, Josh Berkus  wrote:

> On 09/02/2015 11:41 AM, Robert Haas wrote:
> > 4. Therefore, I think that we should instead use logical replication,
> > which might be either synchronous or asynchronous.  When you modify
> > one copy of the data, that change will then be replicated to all other
> > nodes.  If you are OK with eventual consistency, this replication can
> > be asynchronous, and nodes that are off-line will catch up when they
> > are on-line.  If you are not OK with that, then you must replicate
> > synchronously to every node before transaction commit; or at least you
> > must replicate synchronously to every node that is currently on-line.
> > This presents some challenges: logical decoding currently can't
> > replicate transactions that are still in process - replication starts
> > when the transaction commits.  Also, we don't have any way for
> > synchronous replication to wait for multiple nodes.
>
> Well, there is a WIP patch for that, which IMHO would be much improved
> by having a concrete use-case like this one.  What nobody is working on
> -- and we've vetoed in the past -- is a way of automatically failing and
> removing from replication any node which repeatedly fails to sync, which
> would be a requirement for this model.
>
> You'd also need a way to let the connection nodes know when a replica
> has fallen behind so that they can be taken out of
> load-balancing/sharding for read queries.  For the synchronous model,
> that would be "fallen behind at all"; for asynchronous it would be
> "fallen more than ### behind".
>

I have been thinking about that problem in the context of
synchronous_commit = apply, and while trying to review the multiple
synchronous patch.  How are you supposed to actually make use of
remote-apply semantics without a way to find a replica that is consistent?
And what does consistent mean?  I'm going to say consistency means: it sees
at least all effects of all preceding COMMITs that returned successfully.
It's trivial in a no-timeout, single standby topology (if commit returned,
the sole sync replica has applied your transaction and replied), but beyond
that it obviously requires some more infrastructure and concepts.  Here is
my suggestion:

Imagine if we could configure setups like this:

1.  I have 4 servers called london1, london2, paris1, paris2 (see nearby
multiple sync server thread/patch).
2.  I have synchronous_commit = apply (see nearby thread/patch)
3.  Currently london1 is master, with the other 3 in the synchronous
replication set, and I want a minimum replication set of size 2 so I can
lose 1 of those and continue

So far so good, now for the double-vapourware part:

4.  The replicas know whether they are currently part of the master's
synchronous replication set or not because it tells them
5.  While waiting for replicas, the master only waits up to X milliseconds,
and if no reply is received from a given server it drops that server from
the sync rep set, like a RAID controller dropping an unresponsive element
from a RAID array, but still returns control to the user if 2 nodes (our
configured minimum) did reply
6.  If the sync rep set reaches our minimum size 2 because of a node being
dropped, then you can no longer drop one, so commit hangs (work is blocked
until enough servers connect and catch up again)
7.  If a replica sees that it hasn't received WAL records or pings from the
master with timestamps from the past Y milliseconds, or receives a message
explicitly telling it it's been dropped from the sync rep set, it will
start rejecting my queries on the basis that it's potentially out of date
8.  If the master sees suitable apply reply messages stream in from a node
that was dropped but has now caught up (possibly having reconnected), it
will explicitly tell it that it's back in the sync rep set and start
waiting for it again
9.  X is sufficiently larger than Y so that, combined with the 'you've been
dropped from/rejoined the sync rep set' messages and well sync'd system
clocks, it should not be possible for a replica to expose snapshots that
don't include all effects from transactions whose COMMIT command returned
on the master.  (If you don't have the concept of a dynamic replication set
which replicas are dropped from and rejoin, then the master can't drop a
node and carry on, like a RAID controller would, unless it's happy to wait
for any old N nodes to reply.  Waiting for any N nodes to reply may be OK
for log-flush-only sync rep, but if you want to use apply sync rep and have
guarantees about visibility, and you simply wait for any N nodes to reply,
then the nodes themselves don't know if they are up to date with master or
not (they don't know if they were one of the ones that master waited for
some transaction they haven't even heard about yet), so they may show users
old data.   Also, if you don't have the master telling the replicas that it
considers them to be in or out of the replication set, they don't kn

Re: [HACKERS] Horizontal scalability/sharding

2015-09-07 Thread Ahsan Hadi
I

On Monday, September 7, 2015, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

>
>
> On Sat, Sep 5, 2015 at 4:22 AM, Ozgun Erdogan  > wrote:
>
>> Hey Robert,
>>
>> Now the question is, where should the code that does all of this live?
>>>  postgres_fdw?  Some new, sharding-specific FDW?  In core?  I don't
>>> know for sure, but what I do know is that we could make a lot of
>>> progress over where we are today by just improving postgres_fdw, and I
>>> don't think those improvements are even all that difficult.  If we
>>> decide we need to implement something new, it's going to be a huge
>>> project that will take years to complete, with uncertain results.  I'd
>>> rather have a postgres_fdw-based implementation that is imperfect and
>>> can't handle some kinds of queries in 9.6 than a promise that by 9.9
>>> we'll have something really great that handles MPP perfectly.
>>>
>>
>> Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs
>> for pg_shard, we thought that Map/Reduce would require a comprehensive
>> revamp of the APIs.
>>
>> For Citus, a second part of the question is as FDW writers. We
>> implemented cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't
>> benefit from even the simple join pushdown that doesn't require Map/Reduce.
>>
>
> I didn't get this. Join pushdown infrastructure (chiefly set of hooks
> provided in join planning paths) is part of 9.5. Isn't that sufficient to
> implement join push-down for above FDWs? Or FDW writers are facing problems
> while implementing those hooks. In either case that should be reported on
> hackers.
>
>

I don't think any FDW writer (other the postgres_fdw) has tried to
implement join push down in the respective FDW's using the new API.


>
>> The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these
>> have support for joins:
>> https://wiki.postgresql.org/wiki/Foreign_data_wrappers
>>
>> Best,
>> Ozgun
>>
>
>
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>


-- 
Ahsan Hadi
Snr Director Product Development
EnterpriseDB Corporation
The Enterprise Postgres Company

Phone: +92-51-8358874
Mobile: +92-333-5162114

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.


Re: [HACKERS] Horizontal scalability/sharding

2015-09-06 Thread Ashutosh Bapat
On Sat, Sep 5, 2015 at 4:22 AM, Ozgun Erdogan  wrote:

> Hey Robert,
>
> Now the question is, where should the code that does all of this live?
>>  postgres_fdw?  Some new, sharding-specific FDW?  In core?  I don't
>> know for sure, but what I do know is that we could make a lot of
>> progress over where we are today by just improving postgres_fdw, and I
>> don't think those improvements are even all that difficult.  If we
>> decide we need to implement something new, it's going to be a huge
>> project that will take years to complete, with uncertain results.  I'd
>> rather have a postgres_fdw-based implementation that is imperfect and
>> can't handle some kinds of queries in 9.6 than a promise that by 9.9
>> we'll have something really great that handles MPP perfectly.
>>
>
> Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs
> for pg_shard, we thought that Map/Reduce would require a comprehensive
> revamp of the APIs.
>
> For Citus, a second part of the question is as FDW writers. We implemented
> cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't benefit from
> even the simple join pushdown that doesn't require Map/Reduce.
>

I didn't get this. Join pushdown infrastructure (chiefly set of hooks
provided in join planning paths) is part of 9.5. Isn't that sufficient to
implement join push-down for above FDWs? Or FDW writers are facing problems
while implementing those hooks. In either case that should be reported on
hackers.


>
> The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these
> have support for joins:
> https://wiki.postgresql.org/wiki/Foreign_data_wrappers
>
> Best,
> Ozgun
>



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Horizontal scalability/sharding

2015-09-04 Thread Ozgun Erdogan
Hey Robert,

Now the question is, where should the code that does all of this live?
>  postgres_fdw?  Some new, sharding-specific FDW?  In core?  I don't
> know for sure, but what I do know is that we could make a lot of
> progress over where we are today by just improving postgres_fdw, and I
> don't think those improvements are even all that difficult.  If we
> decide we need to implement something new, it's going to be a huge
> project that will take years to complete, with uncertain results.  I'd
> rather have a postgres_fdw-based implementation that is imperfect and
> can't handle some kinds of queries in 9.6 than a promise that by 9.9
> we'll have something really great that handles MPP perfectly.
>

Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs
for pg_shard, we thought that Map/Reduce would require a comprehensive
revamp of the APIs.

For Citus, a second part of the question is as FDW writers. We implemented
cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't benefit from
even the simple join pushdown that doesn't require Map/Reduce.

The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these
have support for joins:
https://wiki.postgresql.org/wiki/Foreign_data_wrappers

Best,
Ozgun


Re: [HACKERS] Horizontal scalability/sharding

2015-09-03 Thread Josh Berkus
On 09/03/2015 03:57 AM, Bruce Momjian wrote:
>> > 
>> > Yes, the logical replication has similar syncrep properties as the
>> > binary one (feedback works same way).

Oh?  What does UDR/BDR currently support for sync?

> Yes, I assumed that.  Logical replication uses WAL, so if you are
> synchronous with WAL, logical replication is synchronous too.  However,
> of course, it is synchronous in being durable, not synchronous in terms
> of applying the WAL.  This is true of binary and logical replication.

Well, there's no such thing as simultaneity in scalable architectures.
But users are already used to that ... anybody who load-balances to read
slaves knows about lag.  The only way* to ensure near-simultenaity is to
have some kind of single-node, single-process GTM for the cluster, and
then your actual scalability goes bye-bye.

The bigger issue we'll need to address with this is the fight between
lag and load-balancing, which would become a much worse issue with
read-load-balanced shards which are transparent to the user.  They'd see
the effects of lag, without having actually chosen to use this or that
replica.  This is the other reason to look at logical replication;
presumably with logrep, we can be more discriminating about what
activities cause lag (for one thing, vacuum won't).

Also:
On 09/03/2015 07:00 AM, Kevin Grittner wrote:
> There is another approach to this that we should consider how (if?)
> we are going to cover: database affinity.  I have seen cases where
> there are multiple databases which are targets of asynchronous
> replication, with a web application load balancing among them.  The
> application kept track of which copy each connection was using, so
> that if when they were not exactly in sync the user never saw "time
> moving backward".  Two different users might see versions of the
> data from different points in time, but that generally doesn't
> matter, especially if the difference is just a few minutes.  If one
> copy got too far behind for some reason, they would load-shift to
> the other servers (time still moves forward, only there is a "jump"
> forward at the shift).  This would allow the tardy database to be
> dedicated to catching up again.
>
> Bottom line is that this very smooth behavior required two features
> -- the ability for the application to control database affinity,
> and the ability to shift that affinity gracefully (with no down
> time).

Yes.  Frankly, it would be *easier* to code things so that the same
session always gets its requests load balanced to the same copies;
making that a feature, too, is nice.


(* there are actually other ways to come close to simultaneity, but they
are much more complicated)

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-03 Thread Andres Freund
On 2015-09-03 12:10:08 -0400, Robert Haas wrote:
> On Thu, Sep 3, 2015 at 6:57 AM, Bruce Momjian  wrote:
> > Yes, I assumed that.  Logical replication uses WAL, so if you are
> > synchronous with WAL, logical replication is synchronous too.  However,
> > of course, it is synchronous in being durable, not synchronous in terms
> > of applying the WAL.  This is true of binary and logical replication.

Actually that's not really true - it's just a question which LSNs you
return. For UDR/BDR the relevant LSN is the LSN of the last durably
committed transaction. And thus they wait for apply, not anything else.

> But, Thomas Munro is fixing it!

+ many to that effort.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-03 Thread Robert Haas
On Thu, Sep 3, 2015 at 6:57 AM, Bruce Momjian  wrote:
> Yes, I assumed that.  Logical replication uses WAL, so if you are
> synchronous with WAL, logical replication is synchronous too.  However,
> of course, it is synchronous in being durable, not synchronous in terms
> of applying the WAL.  This is true of binary and logical replication.

But, Thomas Munro is fixing it!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-03 Thread Bruce Momjian
On Thu, Sep  3, 2015 at 03:40:40PM +0200, Tomas Vondra wrote:
> Not really, the idea is that you don't need to create the replica
> immediately. The system recognizes that primary shard location is
> unavailable and redirects the tasks to the "replicas." So the time
> to recreate the failed node is not that critical.
> 
> It needs to be done in a smart way to prevent some typical issues
> like suddenly doubling the load on replicas due to failure of the
> primary location. By using different group of nodes for each "data
> segment" you can eliminate this, because the group of nodes to
> handle the additional load will be larger.
> 
> The other issue then of course is that the groups of nodes must not
> be entirely random, otherwise the cluster would suffer data loss in
> case of outage of arbitrary group of K nodes (where K is the number
> of replicas for each piece of data).
> 
> It's also non-trivial to do this when you have to consider racks,
> data centers etc.
> 
> With regular slaves you can't do any of this - no matter what you
> do, you have to load balance the additional load only on the slaves.

Yes, and imagine doing this with FDW's, updating the catalog table
location of the FDW as part of the failover process --- interesting.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-03 Thread Kevin Grittner
Josh Berkus  wrote:

>>> You'd also need a way to let the connection nodes know when a replica
>>> has fallen behind so that they can be taken out of
>>> load-balancing/sharding for read queries.  For the synchronous model,
>>> that would be "fallen behind at all"; for asynchronous it would be
>>> "fallen more than ### behind".
>>
>> How is that different from the previous thing?  Just that we'd treat
>> "lagging" as "down" beyond some threshold?  That doesn't seem like a
>> mandatory feature.
>
> It's a mandatory feature if you want to load-balance reads.  We have to
> know which nodes not to send reads to because they are out of sync.

There is another approach to this that we should consider how (if?)
we are going to cover: database affinity.  I have seen cases where
there are multiple databases which are targets of asynchronous
replication, with a web application load balancing among them.  The
application kept track of which copy each connection was using, so
that if when they were not exactly in sync the user never saw "time
moving backward".  Two different users might see versions of the
data from different points in time, but that generally doesn't
matter, especially if the difference is just a few minutes.  If one
copy got too far behind for some reason, they would load-shift to
the other servers (time still moves forward, only there is a "jump"
forward at the shift).  This would allow the tardy database to be
dedicated to catching up again.

Bottom line is that this very smooth behavior required two features
-- the ability for the application to control database affinity,
and the ability to shift that affinity gracefully (with no down
time).

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-03 Thread Tomas Vondra

Hi,

On 09/03/2015 05:02 AM, Amit Kapila wrote:

On Thu, Sep 3, 2015 at 8:28 AM, Bruce Momjian mailto:br...@momjian.us>> wrote:
 >
 > On Wed, Sep  2, 2015 at 07:50:25PM -0700, Joshua Drake wrote:
 > > >Can you explain why logical replication is better than binary
 > > >replication for this use-case?
 > > >
 > >
 > > Selectivity?
 >

I was assuming you would just create identical slaves to handle
failure, rather than moving selected data around.

 >

Yes, I also think so, otherwise when the shard goes down and it's
replica has to take the place of shard, it will take more time to
make replica available as it won't have all the data as original
shard had.


Not really, the idea is that you don't need to create the replica 
immediately. The system recognizes that primary shard location is 
unavailable and redirects the tasks to the "replicas." So the time to 
recreate the failed node is not that critical.


It needs to be done in a smart way to prevent some typical issues like 
suddenly doubling the load on replicas due to failure of the primary 
location. By using different group of nodes for each "data segment" you 
can eliminate this, because the group of nodes to handle the additional 
load will be larger.


The other issue then of course is that the groups of nodes must not be 
entirely random, otherwise the cluster would suffer data loss in case of 
outage of arbitrary group of K nodes (where K is the number of replicas 
for each piece of data).


It's also non-trivial to do this when you have to consider racks, data 
centers etc.


With regular slaves you can't do any of this - no matter what you do, 
you have to load balance the additional load only on the slaves.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-03 Thread Bruce Momjian
On Thu, Sep  3, 2015 at 10:33:12AM +0200, Petr Jelinek wrote:
> On 2015-09-02 19:57, Josh Berkus wrote:
> > On 09/01/2015 04:14 PM, Petr Jelinek wrote:
> >> On 2015-09-02 00:09, Josh Berkus wrote:
> >>> Not really, the mechanism is different and the behavior is different.
> >>> One critical deficiency in using binary syncrep is that you can't do
> >>> round-robin redundancy at all; every redundant node has to be an exact
> >>> mirror of another node.  In a good HA distributed system, you want
> >>> multiple shards per node, and you want each shard to be replicated to a
> >>> different node, so that in the event of node failure you're not dumping
> >>> the full load on one other server.
> >>>
> >>
> >> This assumes that we use binary replication, but we can reasonably use
> >> logical replication which can quite easily do filtering of what's
> >> replicated where.
> >
> > Is there a way to do logical synchronous replication?  I didn't think
> > there was.
> >
> 
> Yes, the logical replication has similar syncrep properties as the
> binary one (feedback works same way).

Yes, I assumed that.  Logical replication uses WAL, so if you are
synchronous with WAL, logical replication is synchronous too.  However,
of course, it is synchronous in being durable, not synchronous in terms
of applying the WAL.  This is true of binary and logical replication.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-03 Thread Petr Jelinek
On 2015-09-02 19:57, Josh Berkus wrote:
> On 09/01/2015 04:14 PM, Petr Jelinek wrote:
>> On 2015-09-02 00:09, Josh Berkus wrote:
>>> Not really, the mechanism is different and the behavior is different.
>>> One critical deficiency in using binary syncrep is that you can't do
>>> round-robin redundancy at all; every redundant node has to be an exact
>>> mirror of another node.  In a good HA distributed system, you want
>>> multiple shards per node, and you want each shard to be replicated to a
>>> different node, so that in the event of node failure you're not dumping
>>> the full load on one other server.
>>>
>>
>> This assumes that we use binary replication, but we can reasonably use
>> logical replication which can quite easily do filtering of what's
>> replicated where.
>
> Is there a way to do logical synchronous replication?  I didn't think
> there was.
>

Yes, the logical replication has similar syncrep properties as the
binary one (feedback works same way).

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Tatsuo Ishii
>> One lesson from XL we got is that we need testing framework for cluster,
>> so any cluster project should at least pass functional and performance
>> testing.
>>
> 
> +1. In early XC days, we focused a lot on adding newer features and
> supporting as many PG features as possible. That took its toll on the
> testing and QA. It was a mistake though my feeling was we tried to correct
> that to some extend with XL. We did a 9.5 merge, which of course was a big
> deal, but other than more time is being spent on improving stability and
> performance

Agreed. Any cluster project needs a cluster testing framework.

pgpool-II project runs "build farm" which runs cluster regression
tests every day. The tests includes several versions of pgpool-II and
PostgreSQL combinations using docker. Still it needs more tests but
even with limited test cases, it is pretty usefull to detect bugs.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Pavan Deolasee
On Wed, Sep 2, 2015 at 9:04 PM, Oleg Bartunov  wrote:

>
>
>
> One lesson from XL we got is that we need testing framework for cluster,
> so any cluster project should at least pass functional and performance
> testing.
>

+1. In early XC days, we focused a lot on adding newer features and
supporting as many PG features as possible. That took its toll on the
testing and QA. It was a mistake though my feeling was we tried to correct
that to some extend with XL. We did a 9.5 merge, which of course was a big
deal, but other than more time is being spent on improving stability and
performance

XL was very easy to break and I'm wondering how many corner cases still
> exists.
>

Your team reported 2 or 3 major issues which I think we were able to fix
quite quickly. But if there are more such issues which your team has
recorded somewhere, I would request you to send them to the XL mailing
list. I would definitely want to look at them and address them.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Amit Kapila
On Thu, Sep 3, 2015 at 8:28 AM, Bruce Momjian  wrote:
>
> On Wed, Sep  2, 2015 at 07:50:25PM -0700, Joshua Drake wrote:
> > >Can you explain why logical replication is better than binary
> > >replication for this use-case?
> > >
> >
> > Selectivity?
>
> I was assuming you would just create identical slaves to handle failure,
> rather than moving selected data around.
>

Yes, I also think so, otherwise when the shard goes down and it's replica
has to take the place of shard, it will take more time to make replica
available as it won't have all the data as original shard had.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Bruce Momjian
On Wed, Sep  2, 2015 at 09:03:25PM -0400, Robert Haas wrote:
> > Can you explain why logical replication is better than binary
> > replication for this use-case?
> 
> Uh, well, for the same reasons it is better in many other cases.
> Particularly, you probably don't want to replicate all the data on
> machine A to machine B, just some of it.
> 
> Typically, sharding solutions store multiple copies of each piece of
> data.  So let's say you have 4 machines.  You divide the data into 12
> chunks.  Each machine is the write-master for 2 of those chunks, but
> has secondary copies of 3 others.  So maybe things start out like
> this:
> 
> machine #1: master for chunks 1, 2, 3; also has copies of chunks 4, 7, 10
> machine #2: master for chunks 4, 5, 6; also has copies of chunks 1, 8, 11
> machine #3: master for chunks 7, 8, 9; also has copies of chunks 2, 5, 12
> machine #4: master for chunks 10, 11, 12; also has copies of chunks 3, 6, 9
> 
> If machine #1 is run over by a rabid triceratops, you can make machine
> #2 the master for chunk 1, machine #3 the master for chunk 2, and
> machine #4 the master for chunk 3.  The write load therefore remains
> evenly divided.  If you can only copy entire machines, you can't
> achieve that in this situation.

I see the advantage of this now.  My original idea is that each shard
would have its own standby for disaster recovery, but your approach
above, which I know is typical, allows the shards to back up each other.
You could say shard 2 is the backup for shard 1, but then if shard one
goes bad, the entire workload of shard 1 goes to shard 2.  With the
above approach, the load of shard 1 is shared by all the shards.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Bruce Momjian
On Wed, Sep  2, 2015 at 07:50:25PM -0700, Joshua Drake wrote:
> >Can you explain why logical replication is better than binary
> >replication for this use-case?
> >
> 
> Selectivity?

I was assuming you would just create identical slaves to handle failure,
rather than moving selected data around.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Joshua D. Drake

On 09/02/2015 03:56 PM, Bruce Momjian wrote:

On Wed, Sep  2, 2015 at 02:41:46PM -0400, Robert Haas wrote:

4. Therefore, I think that we should instead use logical replication,
which might be either synchronous or asynchronous.  When you modify
one copy of the data, that change will then be replicated to all other
nodes.  If you are OK with eventual consistency, this replication can
be asynchronous, and nodes that are off-line will catch up when they
are on-line.  If you are not OK with that, then you must replicate
synchronously to every node before transaction commit; or at least you
must replicate synchronously to every node that is currently on-line.
This presents some challenges: logical decoding currently can't
replicate transactions that are still in process - replication starts
when the transaction commits.  Also, we don't have any way for
synchronous replication to wait for multiple nodes.  But in theory
those seem like limitations that can be lifted.  Also, the GTM needs
to be aware that this stuff is happening, or it will DTWT.  That too
seems like a problem that can be solved.


Can you explain why logical replication is better than binary
replication for this use-case?



Selectivity?

JD


--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Michael Paquier
On Thu, Sep 3, 2015 at 3:41 AM, Robert Haas wrote:
> 3. IIUC, Postgres-XC handles this problem by reducing at least
> volatile functions, maybe all functions, to constants.  Then it
> generates an SQL statement to be sent to the data node to make the
> appropriate change.  If there's more than one copy of the data, we
> send a separate copy of the SQL statement to every node.  I'm not sure
> exactly what happens if some of those nodes are not available, but I
> don't think it's anything good.  Fundamentally, this model doesn't
> allow for many good options in that case.

I don't recall that. Immutable functions are switched to constants in
the query sent to datanodes. Volatile and stable functions are
evaluated locally after fetching the results from the remote node. Not
that efficient for warehouse loads. My 2c.

> 4. Therefore, I think that we should instead use logical replication,
> which might be either synchronous or asynchronous.  When you modify
> one copy of the data, that change will then be replicated to all other
> nodes.  If you are OK with eventual consistency, this replication can
> be asynchronous, and nodes that are off-line will catch up when they
> are on-line.  If you are not OK with that, then you must replicate
> synchronously to every node before transaction commit; or at least you
> must replicate synchronously to every node that is currently on-line.
> This presents some challenges: logical decoding currently can't
> replicate transactions that are still in process - replication starts
> when the transaction commits.  Also, we don't have any way for
> synchronous replication to wait for multiple nodes.

That's something that the quorum synchronous patch would address.
Still, having the possibility to be synchronous across multiple nodes
does not seem like to be something at the top of the list.

> Also, the GTM needs to be aware that this stuff is happening, or it will 
> DTWT.  That too seems like a problem that can be solved.

If I understood correctly, yes it is with its centralized transaction
facility each node is aware of the transaction status via the global
snapshot.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 6:56 PM, Bruce Momjian  wrote:
> On Wed, Sep  2, 2015 at 02:41:46PM -0400, Robert Haas wrote:
>> 4. Therefore, I think that we should instead use logical replication,
>> which might be either synchronous or asynchronous.  When you modify
>> one copy of the data, that change will then be replicated to all other
>> nodes.  If you are OK with eventual consistency, this replication can
>> be asynchronous, and nodes that are off-line will catch up when they
>> are on-line.  If you are not OK with that, then you must replicate
>> synchronously to every node before transaction commit; or at least you
>> must replicate synchronously to every node that is currently on-line.
>> This presents some challenges: logical decoding currently can't
>> replicate transactions that are still in process - replication starts
>> when the transaction commits.  Also, we don't have any way for
>> synchronous replication to wait for multiple nodes.  But in theory
>> those seem like limitations that can be lifted.  Also, the GTM needs
>> to be aware that this stuff is happening, or it will DTWT.  That too
>> seems like a problem that can be solved.
>
> Can you explain why logical replication is better than binary
> replication for this use-case?

Uh, well, for the same reasons it is better in many other cases.
Particularly, you probably don't want to replicate all the data on
machine A to machine B, just some of it.

Typically, sharding solutions store multiple copies of each piece of
data.  So let's say you have 4 machines.  You divide the data into 12
chunks.  Each machine is the write-master for 2 of those chunks, but
has secondary copies of 3 others.  So maybe things start out like
this:

machine #1: master for chunks 1, 2, 3; also has copies of chunks 4, 7, 10
machine #2: master for chunks 4, 5, 6; also has copies of chunks 1, 8, 11
machine #3: master for chunks 7, 8, 9; also has copies of chunks 2, 5, 12
machine #4: master for chunks 10, 11, 12; also has copies of chunks 3, 6, 9

If machine #1 is run over by a rabid triceratops, you can make machine
#2 the master for chunk 1, machine #3 the master for chunk 2, and
machine #4 the master for chunk 3.  The write load therefore remains
evenly divided.  If you can only copy entire machines, you can't
achieve that in this situation.

I'm not saying that the above is exactly what we're going to end up
with, or even necessarily close.  But a big part of the point of
sharding is that not all the machines have the same data - otherwise
you are not write scaling.  But it will frequently be the case, for
various reasons, that they have *overlapping* sets of data.  Logical
replication can handle that; physical replication can't.

In Postgres-XC, all tables are either sharded (part of the table is
present on each node) or distributed (all of the table is present on
every node).  Clearly, there's no way to use physical replication in
that scenario except if you are OK with having two copies of every
node.  But that's not a very good solution.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Bruce Momjian
On Wed, Sep  2, 2015 at 12:03:36PM -0700, Josh Berkus wrote:
> Well, there is a WIP patch for that, which IMHO would be much improved
> by having a concrete use-case like this one.  What nobody is working on
> -- and we've vetoed in the past -- is a way of automatically failing and
> removing from replication any node which repeatedly fails to sync, which
> would be a requirement for this model.
> 
> You'd also need a way to let the connection nodes know when a replica
> has fallen behind so that they can be taken out of
> load-balancing/sharding for read queries.  For the synchronous model,
> that would be "fallen behind at all"; for asynchronous it would be
> "fallen more than ### behind".

I think this gets back to the idea of running an administrative alert
command when we switch to using a different server for
synchronous_standby_names.  We can't just keep requiring external
tooling to identify things that the database knows easily and can send
an alert.  Removing failed nodes is also something we should do and
notify users about.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Bruce Momjian
On Wed, Sep  2, 2015 at 02:41:46PM -0400, Robert Haas wrote:
> 4. Therefore, I think that we should instead use logical replication,
> which might be either synchronous or asynchronous.  When you modify
> one copy of the data, that change will then be replicated to all other
> nodes.  If you are OK with eventual consistency, this replication can
> be asynchronous, and nodes that are off-line will catch up when they
> are on-line.  If you are not OK with that, then you must replicate
> synchronously to every node before transaction commit; or at least you
> must replicate synchronously to every node that is currently on-line.
> This presents some challenges: logical decoding currently can't
> replicate transactions that are still in process - replication starts
> when the transaction commits.  Also, we don't have any way for
> synchronous replication to wait for multiple nodes.  But in theory
> those seem like limitations that can be lifted.  Also, the GTM needs
> to be aware that this stuff is happening, or it will DTWT.  That too
> seems like a problem that can be solved.

Can you explain why logical replication is better than binary
replication for this use-case?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Josh Berkus
On 09/02/2015 12:30 PM, Robert Haas wrote:
> On Wed, Sep 2, 2015 at 3:03 PM, Josh Berkus  wrote:
>>> 4. Therefore, I think that we should instead use logical replication,
>>> which might be either synchronous or asynchronous.  When you modify
>>> one copy of the data, that change will then be replicated to all other
>>> nodes.  If you are OK with eventual consistency, this replication can
>>> be asynchronous, and nodes that are off-line will catch up when they
>>> are on-line.  If you are not OK with that, then you must replicate
>>> synchronously to every node before transaction commit; or at least you
>>> must replicate synchronously to every node that is currently on-line.
>>> This presents some challenges: logical decoding currently can't
>>> replicate transactions that are still in process - replication starts
>>> when the transaction commits.  Also, we don't have any way for
>>> synchronous replication to wait for multiple nodes.
>>
>> Well, there is a WIP patch for that, which IMHO would be much improved
>> by having a concrete use-case like this one.  What nobody is working on
>> -- and we've vetoed in the past -- is a way of automatically failing and
>> removing from replication any node which repeatedly fails to sync, which
>> would be a requirement for this model.
> 
> Yep.  It's clear to me we need that in general, not just for sharding.
> To me, the key is to make sure there's a way for the cluster-ware to
> know about the state transitions.  Currently, when the synchronous
> standby changes, PostgreSQL doesn't tell anyone.  That's a problem.

There are many parts of our replication which are still effectively
unmonitorable. For example, there's still no way to tell from the
replica that it's lost contact with the master except by tailing the
log.  If we try to build bigger systems on top of these components,
we'll find that we need to add a lot of instrumentation.

> 
>> You'd also need a way to let the connection nodes know when a replica
>> has fallen behind so that they can be taken out of
>> load-balancing/sharding for read queries.  For the synchronous model,
>> that would be "fallen behind at all"; for asynchronous it would be
>> "fallen more than ### behind".
> 
> How is that different from the previous thing?  Just that we'd treat
> "lagging" as "down" beyond some threshold?  That doesn't seem like a
> mandatory feature.

It's a mandatory feature if you want to load-balance reads.  We have to
know which nodes not to send reads to because they are out of sync.

>> Yeah?  I'd assume that a GTM would be antithetical to two-stage copying.
> 
> I don't think so.  If transaction A writes data on X which is
> replicated to Y and then commits, a new snapshot which shows A as
> committed can't be used on Y until A's changes have been replicated
> there.  That could be enforced by having the commit of A wait for
> replication, or by having an attempt by a later transaction to use the
> snapshot on Y wait until replication completes, or some even more
> sophisticated strategy that considers whether the replication backlog
> touches the same data that the new transaction will read.  It's
> complicated, but it doesn't seem intractable.

I need to see this on a chalkboard to understand it.

>>  I'm not a big fan of a GTM at all, frankly; it makes clusters much
>> harder to set up, and becomes a SPOF.
> 
> I partially agree.  I think it's very important that the GTM is an
> optional feature of whatever we end up with, rather than an
> indispensable component.  People who don't want it shouldn't have to
> pay the price in performance and administrative complexity.  But at
> the same time, I think a lot of people will want it, because without
> it, the fact that sharding is in use is much less transparent to the
> application.

If it can be optional, then we're pretty close to covering most use
cases with one general infrastructure.  That would be nice.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Tomas Vondra



On 09/02/2015 08:27 PM, Robert Haas wrote:

On Wed, Sep 2, 2015 at 1:59 PM, Merlin Moncure 
wrote:


This strikes me as a bit of a conflict of interest with FDW which
seems to want to hide the fact that it's foreign; the FDW
implementation makes it's own optimization decisions which might
make sense for single table queries but breaks down in the face of
joins.


+1 to these concerns


Well, I don't think that ALL of the logic should go into the FDW.


Then maybe we shouldn't call this "FDW-based sharding" (or "FDW 
approach" or whatever was used in this thread so far) because that kinda 
implies that the proposal is to build on FDW.


In my mind, FDW is a wonderful tool to integrate PostgreSQL with 
external data sources, and it's nicely shaped for this purpose, which 
implies the abstractions and assumptions in the code.


The truth however is that many current uses of the FDW API are actually 
using it for different purposes because there's no other way to do that, 
not because FDWs are the "right way". And this includes the attempts to 
build sharding on FDW, I think.


Situations like this result in "improvements" of the API that seem to 
improve the API for the second group, but make the life harder for the 
original FDW API audience by making the API needlessly complex. And I 
say "seem to improve" because the second group eventually runs into the 
fundamental abstractions and assumptions the API is based on anyway.


And based on the discussions at pgcon, I think this is the main reason 
why people cringe when they hear "FDW" and "sharding" in the same sentence.


I'm not opposed to reusing the FDW infrastructure, of course.

> In particular, in this example, parallel aggregate needs the same
> query rewrite, so the logic for that should live in core so that
> both parallel and distributed queries get the benefit.

I'm not sure the parallel query is a great example here - maybe I'm 
wrong but I think it's a fairly isolated piece of code, and we have 
pretty clear idea of the two use cases.


I'm sure it's non-trivial to design it well for both cases, but I think 
the questions for FWD/sharding will be much more about abstract concepts 
than particular technical solutions.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 3:03 PM, Josh Berkus  wrote:
>> 4. Therefore, I think that we should instead use logical replication,
>> which might be either synchronous or asynchronous.  When you modify
>> one copy of the data, that change will then be replicated to all other
>> nodes.  If you are OK with eventual consistency, this replication can
>> be asynchronous, and nodes that are off-line will catch up when they
>> are on-line.  If you are not OK with that, then you must replicate
>> synchronously to every node before transaction commit; or at least you
>> must replicate synchronously to every node that is currently on-line.
>> This presents some challenges: logical decoding currently can't
>> replicate transactions that are still in process - replication starts
>> when the transaction commits.  Also, we don't have any way for
>> synchronous replication to wait for multiple nodes.
>
> Well, there is a WIP patch for that, which IMHO would be much improved
> by having a concrete use-case like this one.  What nobody is working on
> -- and we've vetoed in the past -- is a way of automatically failing and
> removing from replication any node which repeatedly fails to sync, which
> would be a requirement for this model.

Yep.  It's clear to me we need that in general, not just for sharding.
To me, the key is to make sure there's a way for the cluster-ware to
know about the state transitions.  Currently, when the synchronous
standby changes, PostgreSQL doesn't tell anyone.  That's a problem.

> You'd also need a way to let the connection nodes know when a replica
> has fallen behind so that they can be taken out of
> load-balancing/sharding for read queries.  For the synchronous model,
> that would be "fallen behind at all"; for asynchronous it would be
> "fallen more than ### behind".

How is that different from the previous thing?  Just that we'd treat
"lagging" as "down" beyond some threshold?  That doesn't seem like a
mandatory feature.

>> But in theory
>> those seem like limitations that can be lifted.  Also, the GTM needs
>> to be aware that this stuff is happening, or it will DTWT.  That too
>> seems like a problem that can be solved.
>
> Yeah?  I'd assume that a GTM would be antithetical to two-stage copying.

I don't think so.  If transaction A writes data on X which is
replicated to Y and then commits, a new snapshot which shows A as
committed can't be used on Y until A's changes have been replicated
there.  That could be enforced by having the commit of A wait for
replication, or by having an attempt by a later transaction to use the
snapshot on Y wait until replication completes, or some even more
sophisticated strategy that considers whether the replication backlog
touches the same data that the new transaction will read.  It's
complicated, but it doesn't seem intractable.

>  I'm not a big fan of a GTM at all, frankly; it makes clusters much
> harder to set up, and becomes a SPOF.

I partially agree.  I think it's very important that the GTM is an
optional feature of whatever we end up with, rather than an
indispensable component.  People who don't want it shouldn't have to
pay the price in performance and administrative complexity.  But at
the same time, I think a lot of people will want it, because without
it, the fact that sharding is in use is much less transparent to the
application.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Josh Berkus
On 09/02/2015 11:41 AM, Robert Haas wrote:
> On Wed, Sep 2, 2015 at 1:57 PM, Josh Berkus  wrote:
>> Even if it's only on paper, any new sharding design needs to address
>> these questions:
>>
>> 1. How do we ensure no/minimal data is lost if we lose a node?
>> 2. How do we replace a lost node (without taking the cluster down)?
>>2. a. how do we allow an out-of-sync node to "catch up"?
>> 3. How do we maintain metadata about good/bad nodes (and shard locations)?
>> 4. How do we add nodes to expand the cluster?
>>
>> There doesn't need to be code for all of the above from version 0.1, but
>> there needs to be a plan to tackle those problems.  Otherwise, we'll
>> just end up with another dead-end, not-useful-in-production technology.
> 
> This is a good point, and I think I agree with it.  Let me make a few
> observations:
> 
> 1. None of this stuff matters very much when the data is strictly
> read-only. 

Yep.

> 2. None of this stuff matters when you only have one copy of the data.
> Your system is low-availability, but you just don't care for whatever
> reason. 

Uh-huh.

> 3. IIUC, Postgres-XC handles this problem by reducing at least
> volatile functions, maybe all functions, to constants.  Then it
> generates an SQL statement to be sent to the data node to make the
> appropriate change.  If there's more than one copy of the data, we
> send a separate copy of the SQL statement to every node.  I'm not sure
> exactly what happens if some of those nodes are not available, but I
> don't think it's anything good.  Fundamentally, this model doesn't
> allow for many good options in that case.

pg_shard also sends the data to each node, and automatically notices
which nodes are not responding and takes them out of availability.
There isn't a "catch up" feature yet (AFAIK), or any attempt to reduce
volatile functions.

For that matter, last I worked on it Greenplum also did multiplexing via
the writing node (or via the data loader).  So this is a popular
approach; it has a number of drawbacks, though, of which volatile
functions are a major one.

> 4. Therefore, I think that we should instead use logical replication,
> which might be either synchronous or asynchronous.  When you modify
> one copy of the data, that change will then be replicated to all other
> nodes.  If you are OK with eventual consistency, this replication can
> be asynchronous, and nodes that are off-line will catch up when they
> are on-line.  If you are not OK with that, then you must replicate
> synchronously to every node before transaction commit; or at least you
> must replicate synchronously to every node that is currently on-line.
> This presents some challenges: logical decoding currently can't
> replicate transactions that are still in process - replication starts
> when the transaction commits.  Also, we don't have any way for
> synchronous replication to wait for multiple nodes.  

Well, there is a WIP patch for that, which IMHO would be much improved
by having a concrete use-case like this one.  What nobody is working on
-- and we've vetoed in the past -- is a way of automatically failing and
removing from replication any node which repeatedly fails to sync, which
would be a requirement for this model.

You'd also need a way to let the connection nodes know when a replica
has fallen behind so that they can be taken out of
load-balancing/sharding for read queries.  For the synchronous model,
that would be "fallen behind at all"; for asynchronous it would be
"fallen more than ### behind".

> But in theory
> those seem like limitations that can be lifted.  Also, the GTM needs
> to be aware that this stuff is happening, or it will DTWT.  That too
> seems like a problem that can be solved.

Yeah?  I'd assume that a GTM would be antithetical to two-stage copying.
 I'm not a big fan of a GTM at all, frankly; it makes clusters much
harder to set up, and becomes a SPOF.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 1:57 PM, Josh Berkus  wrote:
> Even if it's only on paper, any new sharding design needs to address
> these questions:
>
> 1. How do we ensure no/minimal data is lost if we lose a node?
> 2. How do we replace a lost node (without taking the cluster down)?
>2. a. how do we allow an out-of-sync node to "catch up"?
> 3. How do we maintain metadata about good/bad nodes (and shard locations)?
> 4. How do we add nodes to expand the cluster?
>
> There doesn't need to be code for all of the above from version 0.1, but
> there needs to be a plan to tackle those problems.  Otherwise, we'll
> just end up with another dead-end, not-useful-in-production technology.

This is a good point, and I think I agree with it.  Let me make a few
observations:

1. None of this stuff matters very much when the data is strictly
read-only.  You don't lose any data because you made enough copies at
some point in the distant past to ensure that you wouldn't.  You
replace a lost node by taking anew copy.  Nodes never need to catch up
because there are no changes happening.  To make bring up a new node,
you make a copy of an existing node (which doesn't change in the
meantime).  So most of these concerns are about how to handle writes.

2. None of this stuff matters when you only have one copy of the data.
Your system is low-availability, but you just don't care for whatever
reason.  The issue arises when you have multiple copies of the data,
and the data is being changed.  Now, you have to worry about the
copies getting out of sync with each other, especially when failures
happen.

3. IIUC, Postgres-XC handles this problem by reducing at least
volatile functions, maybe all functions, to constants.  Then it
generates an SQL statement to be sent to the data node to make the
appropriate change.  If there's more than one copy of the data, we
send a separate copy of the SQL statement to every node.  I'm not sure
exactly what happens if some of those nodes are not available, but I
don't think it's anything good.  Fundamentally, this model doesn't
allow for many good options in that case.

4. Therefore, I think that we should instead use logical replication,
which might be either synchronous or asynchronous.  When you modify
one copy of the data, that change will then be replicated to all other
nodes.  If you are OK with eventual consistency, this replication can
be asynchronous, and nodes that are off-line will catch up when they
are on-line.  If you are not OK with that, then you must replicate
synchronously to every node before transaction commit; or at least you
must replicate synchronously to every node that is currently on-line.
This presents some challenges: logical decoding currently can't
replicate transactions that are still in process - replication starts
when the transaction commits.  Also, we don't have any way for
synchronous replication to wait for multiple nodes.  But in theory
those seem like limitations that can be lifted.  Also, the GTM needs
to be aware that this stuff is happening, or it will DTWT.  That too
seems like a problem that can be solved.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 1:59 PM, Merlin Moncure  wrote:
> On Tue, Sep 1, 2015 at 11:18 AM, Robert Haas  wrote:
>> It would be a bad idea to cling blindly to the FDW infrastructure if
>> it's fundamentally inadequate to do what we want.  On the other hand,
>> it would also be a bad idea to set about recreating it without a
>> really good reason, and - just to take one example - the fact that it
>> doesn't currently push down DML operations to the remote side is not a
>> really good reason to rewrite the whole thing.  On the contrary, it's
>> a reason to put some energy into the already-written patch which
>> implements that optimization.
>
> The problem with FDW for these purposes as I see it is that too much
> intelligence is relegated to the implementer of the API.  There needs
> to be a mechanism so that the planner can rewrite the remote query and
> then do some after the fact processing.  This exactly what citus does;
> if you send out AVG(foo) it rewrites that to SUM(foo) and COUNT(foo)
> so that aggregation can be properly weighted to the result.   To do
> this (distributed OLAP-type processing) right, the planner needs to
> *know* that this table is in fact distributed and also know that it
> can make SQL compatible adjustments to the query.
>
> This strikes me as a bit of a conflict of interest with FDW which
> seems to want to hide the fact that it's foreign; the FDW
> implementation makes it's own optimization decisions which might make
> sense for single table queries but breaks down in the face of joins.

Well, I don't think that ALL of the logic should go into the FDW.  In
particular, in this example, parallel aggregate needs the same query
rewrite, so the logic for that should live in core so that both
parallel and distributed queries get the benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Merlin Moncure
On Tue, Sep 1, 2015 at 11:18 AM, Robert Haas  wrote:
> It would be a bad idea to cling blindly to the FDW infrastructure if
> it's fundamentally inadequate to do what we want.  On the other hand,
> it would also be a bad idea to set about recreating it without a
> really good reason, and - just to take one example - the fact that it
> doesn't currently push down DML operations to the remote side is not a
> really good reason to rewrite the whole thing.  On the contrary, it's
> a reason to put some energy into the already-written patch which
> implements that optimization.

The problem with FDW for these purposes as I see it is that too much
intelligence is relegated to the implementer of the API.  There needs
to be a mechanism so that the planner can rewrite the remote query and
then do some after the fact processing.  This exactly what citus does;
if you send out AVG(foo) it rewrites that to SUM(foo) and COUNT(foo)
so that aggregation can be properly weighted to the result.   To do
this (distributed OLAP-type processing) right, the planner needs to
*know* that this table is in fact distributed and also know that it
can make SQL compatible adjustments to the query.

This strikes me as a bit of a conflict of interest with FDW which
seems to want to hide the fact that it's foreign; the FDW
implementation makes it's own optimization decisions which might make
sense for single table queries but breaks down in the face of joins.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Josh Berkus
On 09/01/2015 04:14 PM, Petr Jelinek wrote:
> On 2015-09-02 00:09, Josh Berkus wrote:
>> On 09/01/2015 02:29 PM, Tomas Vondra wrote:
>>> So while you may be right in single-DC deployments, with multi-DC
>>> deployments the situation is quite different - not only that the network
>>> bandwidth is not unlimited, but because latencies within DC may be a
>>> fraction of latencies between the locations (to the extent that the
>>> increase due to syncrep may be just noise). So the local replication may
>>> be actually way faster.
>>
>> I'm not seeing how the above is better using syncrep than using shard
>> copying?
> 
> Shard copying usually assumes that the origin node does the copy - the
> data has to go twice through the slow connection. With replication you
> can replicate locally over fast connection.

Ah, I was thinking of the case of having a single set of copies in the
remote DC, but of course that isn't going to be the case with a highly
redundant setup.

Basically this seems to be saying that, in an ideal setup, we'd have
some kind of synchronous per-shard replication.  We don't have that at
present (sync rep is whole-node, and BDR is asynchronous).  There's also
the question of how to deal with failures and taking bad nodes out of
circulation in such a setup, especially considering that the writes
could be coming from multiple other nodes.

>> Not really, the mechanism is different and the behavior is different.
>> One critical deficiency in using binary syncrep is that you can't do
>> round-robin redundancy at all; every redundant node has to be an exact
>> mirror of another node.  In a good HA distributed system, you want
>> multiple shards per node, and you want each shard to be replicated to a
>> different node, so that in the event of node failure you're not dumping
>> the full load on one other server.
>>
> 
> This assumes that we use binary replication, but we can reasonably use
> logical replication which can quite easily do filtering of what's
> replicated where.

Is there a way to do logical synchronous replication?  I didn't think
there was.

>>> IMHO the design has to address the multi-DC setups somehow. I think that
>>> many of the customers who are so concerned about scaling to many shards
>>> are also concerned about availability in case of DC outages, no?
>>
>> Certainly.  But users located in a single DC shouldn't pay the same
>> overhead as users who are geographically spread.
>>
> 
> Agreed, so we should support both ways, but I don't think it's necessary
> to support both ways in version 0.1. It's just important to not paint
> ourselves into a corner with design decisions that would make one of the
> ways impossible.

Exactly!

Let me explain why I'm so vocal on this point.  PostgresXC didn't deal
with the redundancy/node replacement at all until after version 1.0.
Then, when they tried to address it, they discovered that the code was
chock full of assumptions that "1 node == 1 shard", and breaking that
assumption would require a total refactor of the code (which never
happened).  I don't want to see a repeat of that mistake.

Even if it's only on paper, any new sharding design needs to address
these questions:

1. How do we ensure no/minimal data is lost if we lose a node?
2. How do we replace a lost node (without taking the cluster down)?
   2. a. how do we allow an out-of-sync node to "catch up"?
3. How do we maintain metadata about good/bad nodes (and shard locations)?
4. How do we add nodes to expand the cluster?

There doesn't need to be code for all of the above from version 0.1, but
there needs to be a plan to tackle those problems.  Otherwise, we'll
just end up with another dead-end, not-useful-in-production technology.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Oleg Bartunov
On Tue, Sep 1, 2015 at 7:08 PM, Robert Haas  wrote:

> On Tue, Sep 1, 2015 at 12:00 AM, Pavan Deolasee
>  wrote:
> > My worry is that if we start implementing them again from scratch, it
> will
> > take a few years before we get them in a usable state. What XC/XL lacked
> is
> > probably a Robert Haas or a Tom Lane who could look at the work and
> suggest
> > major edits. If that had happened, the quality of the product could have
> > been much better today. I don't mean to derate the developers who worked
> on
> > XC/XL, but there is no harm in accepting that if someone with a much
> better
> > understanding of the whole system was part of the team, that would have
> > positively impacted the project. Is that an angle worth exploring? Does
> it
> > make sense to commit some more resources to say XC or XL and try to
> improve
> > the quality of the product even further? To be honest, XL is in far far
> > better shape (haven't really tried XC in a while) and some more
> QA/polishing
> > can make it production ready much sooner.
>
> From my point of view, and EnterpriseDB's point of view, anything that
> doesn't go into the core PostgreSQL distribution isn't really getting
> us where we need to be.  If there's code in XL that would be valuable
> to merge into core PostgreSQL, then let's do it.  If the code cannot
> be used but there are lessons we can learn that will make what does go
> into core PostgreSQL better, let's learn them.  However, I don't think
> it's serving anybody very well that we have the XC fork, and multiple
> forks of the XC fork, floating around out there and people are working
> on those instead of working on core PostgreSQL.  The reality is that
> we don't have enough brainpower to spread it across 2 or 3 or 4 or 5
> different projects and have all of them be good.  The reality is,
> also, that horizontal scalability isn't an optional feature.  There
> was a point in time at which the PostgreSQL project's official policy
> on replication was that it did not belong in core.  That was a bad
> policy; thankfully, it was reversed, and the result was Hot Standby
> and Streaming Replication, incredibly important technologies without
> which we would not be where we are today. Horizontal scalability is
> just as essential.
>

Agree with you, Robert.

One lesson from XL we got is that we need testing framework for cluster, so
any cluster project should at least pass functional and performance
testing. XL was very easy to break and I'm wondering how many corner cases
still exists. We tried several other approaches and while reading the
papers was a fun, in practice we found many devil details, which made the
paper be just a paper.



>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Bruce Momjian
On Tue, Sep  1, 2015 at 06:11:45PM -0400, Bruce Momjian wrote:
> Let me clearer about what the Citus Data paper shows.  I said originally
> that the data was sent to the coordinator, sorted, then resent to the
> shards, but the document:
> 
>   https://goo.gl/vJWF85
>   
> https://www.citusdata.com/blog/114-how-to-build-your-distributed-database
> 
> has the shards create the groups and the groups are sent to the other
> shards.  For example, to do COUNT(DISTINCT) if you have three shards,
> then each shard breaks its data into 3 buckets (1B in size), then the
> first bucket from each of the three shards goes to the first shard, and
> the second bucket goes to the second shared, etc.
> 
> Basically, they are doing map-reduce, and the shards are creating
> additional batches that get shipped to other shards.  I can see FDWs not
> working well in that case as you are really creating a new data layout
> just for the query.  This explains why the XC/XL people are saying they
> would use FDWs if they existed at the time they started development,
> while the Citus Data people are saying they couldn't use FDWs as they
> currently exist.  They probably both needed FDW improvements, but I
> think the Citus Data features would need a lot more.

To expand on this, using FDWs, it means each shard would create a
temporary table on the other shards and send some if its data to those
shards.  Once a shard gets all its data from the other shards, it will
process the data and send the result to the collector.

That certainly seems like something FDWs would not do well.  Frankly, I
am unclear how Citus Data was able to do this with only backend hooks.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Etsuro Fujita

On 2015/09/02 20:42, Amit Kapila wrote:

On Wed, Sep 2, 2015 at 4:19 PM, Pavan Deolasee mailto:pavan.deola...@gmail.com>> wrote:
 > On Wed, Sep 2, 2015 at 3:55 PM, Amit Langote
mailto:langote_amit...@lab.ntt.co.jp>>
wrote:
 >> On 2015-09-02 PM 06:41, Amit Langote wrote:
 >> Which, I now realize, is not the worry Amit Kapila's expresses.



 >> The deadlock was *indeed detected* in this case, with all the locks
in the
 >> same PG instance. In a sharded environment with multiple PG instances,
 >> that becomes tricky. DLM (distributed lock manager/deadlock detector)
 >> seems indeed necessary as Amit K. suspects.


Ah, you are right.


 > Right. XC/XL did not address this issue and they rely on statement
timeouts to break distributed deadlocks.



I think that will be difficult for application to decide and then it
needs to decide the same for all statements which is tricky because
different statements could take different time.  I think it is better to
have solution for this problem and deadlock should be detected.


+1

Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Amit Kapila
On Wed, Sep 2, 2015 at 4:19 PM, Pavan Deolasee 
wrote:
> On Wed, Sep 2, 2015 at 3:55 PM, Amit Langote <
langote_amit...@lab.ntt.co.jp> wrote:
>>
>> On 2015-09-02 PM 06:41, Amit Langote wrote:
>>
>> Which, I now realize, is not the worry Amit Kapila's expresses.
>>
>> The deadlock was *indeed detected* in this case, with all the locks in
the
>> same PG instance. In a sharded environment with multiple PG instances,
>> that becomes tricky. DLM (distributed lock manager/deadlock detector)
>> seems indeed necessary as Amit K. suspects.
>>
>
> Right. XC/XL did not address this issue and they rely on statement
timeouts to break distributed deadlocks.
>

I think that will be difficult for application to decide and then it
needs to decide the same for all statements which is tricky because
different statements could take different time.  I think it is better to
have solution for this problem and deadlock should be detected.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Pavan Deolasee
On Wed, Sep 2, 2015 at 3:55 PM, Amit Langote 
wrote:

> On 2015-09-02 PM 06:41, Amit Langote wrote:
> >
> > I think Albe may have a point here...
> >
> > Even inherited updates case appears to cause a deadlock if they are in
> > different queries. Demonstrated below:
> >
> > -- setup
> > CREATE TABLE t(a int);
> > CREATE TABLE t1() INHERITS(t);
> > CREATE TABLE t2() INHERITS(t);
> >
> > INSERT INTO t1 VALUES (1);
> > INSERT INTO t2 VALUES (2);
> >
> > -- in session 1
> > BEGIN;
> > UPDATE t SET a = a + 1 WHERE a = 1;
> > 
> >
> > -- in session 2
> > BEGIN;
> > UPDATE t SET a = a + 1 WHERE a = 2;
> > 
> >
> > -- back in session 1
> > UPDATE t SET a = a + 1 WHERE a = 2;
> > 
> >
> > -- back in session 2
> > UPDATE t SET a = a + 1 WHERE a = 1;
> > 
> >
>
> Which, I now realize, is not the worry Amit Kapila's expresses.
>
> The deadlock was *indeed detected* in this case, with all the locks in the
> same PG instance. In a sharded environment with multiple PG instances,
> that becomes tricky. DLM (distributed lock manager/deadlock detector)
> seems indeed necessary as Amit K. suspects.
>
>
Right. XC/XL did not address this issue and they rely on statement timeouts
to break distributed deadlocks.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Amit Langote
On 2015-09-02 PM 06:41, Amit Langote wrote:
> 
> I think Albe may have a point here...
> 
> Even inherited updates case appears to cause a deadlock if they are in
> different queries. Demonstrated below:
> 
> -- setup
> CREATE TABLE t(a int);
> CREATE TABLE t1() INHERITS(t);
> CREATE TABLE t2() INHERITS(t);
> 
> INSERT INTO t1 VALUES (1);
> INSERT INTO t2 VALUES (2);
> 
> -- in session 1
> BEGIN;
> UPDATE t SET a = a + 1 WHERE a = 1;
> 
> 
> -- in session 2
> BEGIN;
> UPDATE t SET a = a + 1 WHERE a = 2;
> 
> 
> -- back in session 1
> UPDATE t SET a = a + 1 WHERE a = 2;
> 
> 
> -- back in session 2
> UPDATE t SET a = a + 1 WHERE a = 1;
> 
> 

Which, I now realize, is not the worry Amit Kapila's expresses.

The deadlock was *indeed detected* in this case, with all the locks in the
same PG instance. In a sharded environment with multiple PG instances,
that becomes tricky. DLM (distributed lock manager/deadlock detector)
seems indeed necessary as Amit K. suspects.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Ashutosh Bapat
On Wed, Sep 2, 2015 at 12:49 AM, Josh Berkus  wrote:

> On 09/01/2015 11:36 AM, Tomas Vondra wrote:
> >> We want multiple copies of shards created by the sharding system itself.
> >>   Having a separate, and completely orthagonal, redundancy system to the
> >> sharding system is overly burdensome on the DBA and makes low-data-loss
> >> HA impossible.
> >
> > IMHO it'd be quite unfortunate if the design would make it impossible to
> > combine those two features (e.g. creating standbys for shards and
> > failing over to them).
> >
> > It's true that solving HA at the sharding level (by keeping multiple
> > copies of a each shard) may be simpler than combining sharding and
> > standbys, but I don't see why it makes low-data-loss HA impossible.
>
> Other way around, that is, having replication standbys as the only
> method of redundancy requires either high data loss or high latency for
> all writes.
>
> In the case of async rep, every time we fail over a node, the entire
> cluser would need to roll back to the last common known-good replay
> point, hence high data loss.
>
> In the case of sync rep, we are required to wait for at least double
> network lag time in order to do a single write ... making
> write-scalability quite difficult.
>
> Futher, if using replication the sharding system would have no way to
> (a) find out immediately if a copy was bad and (b) fail over quickly to
> a copy of the shard if the first requested copy was not responding.
> With async replication, we also can't use multiple copies of the same
> shard as a way to balance read workloads.
>
> If we write to multiple copies as a part of the sharding feature, then
> that can be parallelized, so that we are waiting only as long as the
> slowest write (or in failure cases, as long as the shard timeout).
> Further, we can check for shard-copy health and update shard
> availability data with each user request, so that the ability to see
> stale/bad data is minimized.
>

XC (and I guess XL, pgPool II as well) did this by firing same DML
statement to all the copies after resolving any volatile references  (e.g.
now()) in DML, so that all the copies get the same values. That method
however needed some row identifier which can identify same row on all the
replicas. Primary key is used as row identifier usually, but not all use
cases which require shards to be replicated have primary key in their
sharded tables.


>
> There are obvious problems with multiplexing writes, which you can
> figure out if you knock pg_shard around a bit.  But I really think that
> solving those problems is the only way to go.
>
> Mind you, I see a strong place for binary replication and BDR for
> multi-region redundancy; you really don't want that to be part of the
> sharding system if you're aiming for write scalability.
>
> --
> Josh Berkus
> PostgreSQL Experts Inc.
> http://pgexperts.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Amit Langote
On 2015-09-02 PM 05:07, Etsuro Fujita wrote:
> On 2015/09/02 16:40, Amit Langote wrote:
>> On 2015-09-02 PM 04:07, Albe Laurenz wrote:
>>>
>>> That would only hold for a single query, right?
>>>
>>> If 1. and 2. in the above example come from different queries within one
>>> transaction, you cannot guarantee that shards are processed in the same
>>> order.
>>>
>>> So T1 and T2 could deadlock.
> 
>> Sorry, I failed to see why that would be the case. Could you elaborate?
> 
> I think Laurenz would assume that the updates 1. and 2. in the above
> transactions are performed *in a non-inherited manner*.  If that's right,
> T1 and T2 could deadlock, but I think we assume here to run transactions
> over shards *in an inherited manner*.
> 

I think Albe may have a point here...

Even inherited updates case appears to cause a deadlock if they are in
different queries. Demonstrated below:

-- setup
CREATE TABLE t(a int);
CREATE TABLE t1() INHERITS(t);
CREATE TABLE t2() INHERITS(t);

INSERT INTO t1 VALUES (1);
INSERT INTO t2 VALUES (2);

-- in session 1
BEGIN;
UPDATE t SET a = a + 1 WHERE a = 1;


-- in session 2
BEGIN;
UPDATE t SET a = a + 1 WHERE a = 2;


-- back in session 1
UPDATE t SET a = a + 1 WHERE a = 2;


-- back in session 2
UPDATE t SET a = a + 1 WHERE a = 1;



Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Albe Laurenz
Etsuro Fujita wrote:
> On 2015/09/02 16:40, Amit Langote wrote:
>> On 2015-09-02 PM 04:07, Albe Laurenz wrote:
>>> Amit Langote wrote:
 On 2015-09-02 PM 03:25, Amit Kapila wrote:
> Will it handle deadlocks across different table partitions. Consider
> a case as below:
>
> T1
> 1. Updates row R1 of T1 on shard S1
> 2. Updates row R2 of T2 on shard S2
>
> T2
> 1. Updates row R2 of T2 on shard S2
> 2. Updates row R1 of T1 on shard S1
>>>
 As long as shards are processed in the same order in different
 transactions, ISTM, this issue should not arise? I can imagine it becoming
 a concern if parallel shard processing enters the scene. Am I missing
 something?
>>>
>>> That would only hold for a single query, right?
>>>
>>> If 1. and 2. in the above example come from different queries within one
>>> transaction, you cannot guarantee that shards are processed in the same 
>>> order.
>>>
>>> So T1 and T2 could deadlock.
> 
>> Sorry, I failed to see why that would be the case. Could you elaborate?
> 
> I think Laurenz would assume that the updates 1. and 2. in the above
> transactions are performed *in a non-inherited manner*.  If that's
> right, T1 and T2 could deadlock, but I think we assume here to run
> transactions over shards *in an inherited manner*.

Yes, but does every update affect all shards?

If I say "UPDATE t1 SET col = 1 WHERE id = 42" and the row with id 42
happens to be on shard S1, the update would only affect that shard, right?

Now if "UPDATE t2 SET col = 1 WHERE id = 42" would only take place on
shard S2, and two transactions issue both updates in different order,
one transaction would be waiting for a lock on shard S1, while the other
would be waiting for a lock on shard S2, right?

But maybe I'm missing something fundamental.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Etsuro Fujita

On 2015/09/02 16:40, Amit Langote wrote:

On 2015-09-02 PM 04:07, Albe Laurenz wrote:

Amit Langote wrote:

On 2015-09-02 PM 03:25, Amit Kapila wrote:

Will it handle deadlocks across different table partitions. Consider
a case as below:

T1
1. Updates row R1 of T1 on shard S1
2. Updates row R2 of T2 on shard S2

T2
1. Updates row R2 of T2 on shard S2
2. Updates row R1 of T1 on shard S1



As long as shards are processed in the same order in different
transactions, ISTM, this issue should not arise? I can imagine it becoming
a concern if parallel shard processing enters the scene. Am I missing
something?


That would only hold for a single query, right?

If 1. and 2. in the above example come from different queries within one
transaction, you cannot guarantee that shards are processed in the same order.

So T1 and T2 could deadlock.



Sorry, I failed to see why that would be the case. Could you elaborate?


I think Laurenz would assume that the updates 1. and 2. in the above 
transactions are performed *in a non-inherited manner*.  If that's 
right, T1 and T2 could deadlock, but I think we assume here to run 
transactions over shards *in an inherited manner*.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Amit Langote
On 2015-09-02 PM 04:07, Albe Laurenz wrote:
> Amit Langote wrote:
>> On 2015-09-02 PM 03:25, Amit Kapila wrote:
>>> Will it handle deadlocks across different table partitions. Consider
>>> a case as below:
>>>
>>> T1
>>> 1. Updates row R1 of T1 on shard S1
>>> 2. Updates row R2 of T2 on shard S2
>>>
>>> T2
>>> 1. Updates row R2 of T2 on shard S2
>>> 2. Updates row R1 of T1 on shard S1
> 
>> As long as shards are processed in the same order in different
>> transactions, ISTM, this issue should not arise? I can imagine it becoming
>> a concern if parallel shard processing enters the scene. Am I missing
>> something?
> 
> That would only hold for a single query, right?
> 
> If 1. and 2. in the above example come from different queries within one
> transaction, you cannot guarantee that shards are processed in the same order.
> 
> So T1 and T2 could deadlock.
> 

Sorry, I failed to see why that would be the case. Could you elaborate?

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Albe Laurenz
Amit Langote wrote:
> On 2015-09-02 PM 03:25, Amit Kapila wrote:
>> Will it handle deadlocks across different table partitions. Consider
>> a case as below:
>>
>> T1
>> 1. Updates row R1 of T1 on shard S1
>> 2. Updates row R2 of T2 on shard S2
>>
>> T2
>> 1. Updates row R2 of T2 on shard S2
>> 2. Updates row R1 of T1 on shard S1

> As long as shards are processed in the same order in different
> transactions, ISTM, this issue should not arise? I can imagine it becoming
> a concern if parallel shard processing enters the scene. Am I missing
> something?

That would only hold for a single query, right?

If 1. and 2. in the above example come from different queries within one
transaction, you cannot guarantee that shards are processed in the same order.

So T1 and T2 could deadlock.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Etsuro Fujita

On 2015/09/02 15:40, Amit Langote wrote:

On 2015-09-02 PM 03:25, Amit Kapila wrote:

On Wed, Sep 2, 2015 at 11:35 AM, Etsuro Fujita 

The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane
behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down
UPDATE/DELETE does, because inheritance_planner guarantees that all
backends lock inheritance children in the same order to avoid needless
deadlocks.



Will it be able to do it for row level locks, row level locking occurs
during updation of a row, so will it be possible to ensure the order of
locks on rows?



Will it handle deadlocks across different table partitions. Consider
a case as below:

T1
1. Updates row R1 of T1 on shard S1
2. Updates row R2 of T2 on shard S2

T2
1. Updates row R2 of T2 on shard S2
2. Updates row R1 of T1 on shard S1



As long as shards are processed in the same order in different
transactions, ISTM, this issue should not arise? I can imagine it becoming
a concern if parallel shard processing enters the scene. Am I missing
something?


Yeah, I thinks so, too.

Sorry, maybe my explanation above was not enough, but in the inherted 
UPDATEs/DELETEs, the table modification is also ensured to be done in 
the same order.  So, as Amit Langote said, both transactions will do the 
updates in the same order.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Amit Langote
On 2015-09-02 PM 03:25, Amit Kapila wrote:
> On Wed, Sep 2, 2015 at 11:35 AM, Etsuro Fujita 
>>
>> The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane
>> behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down
>> UPDATE/DELETE does, because inheritance_planner guarantees that all
>> backends lock inheritance children in the same order to avoid needless
>> deadlocks.
>>
>>
> Will it be able to do it for row level locks, row level locking occurs
> during updation of a row, so will it be possible to ensure the order of
> locks on rows?
> 
> Will it handle deadlocks across different table partitions. Consider
> a case as below:
> 
> T1
> 1. Updates row R1 of T1 on shard S1
> 2. Updates row R2 of T2 on shard S2
> 
> T2
> 1. Updates row R2 of T2 on shard S2
> 2. Updates row R1 of T1 on shard S1
> 

As long as shards are processed in the same order in different
transactions, ISTM, this issue should not arise? I can imagine it becoming
a concern if parallel shard processing enters the scene. Am I missing
something?

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Amit Kapila
On Wed, Sep 2, 2015 at 11:35 AM, Etsuro Fujita 
wrote:

> On 2015/09/02 14:28, Amit Langote wrote:
>
>> On 2015-09-02 PM 01:28, Amit Kapila wrote:
>>
>>> On Tue, Sep 1, 2015 at 9:48 PM, Robert Haas 
 wrote:

> I'm not averse to making the "connect to the remote nodes" part of
> this solution use something other than the FDW infrastructure at some
> point in time if somebody's prepared to build something better.  On
> the other hand, I think it's extremely clear that the FDW
> infrastructure has a large amount of potential upon which we have
> thoroughly failed to capitalize.  Patches have already been written
> for UPDATE/DELETE pushdown and for join pushdown.
>

> Will pushing down writes (Update/Delete) sufficient to maintain sane
>>> locking
>>> behaviour and deadlock detection that can occur during writes on multiple
>>> shards?  For example it could easily be the case where a single Update
>>> statement could effect multiple shards and cause deadlock due to waits
>>> across the nodes.  Now unless we have some distributed lock manager or
>>> some other way to know the information of locks that happens across
>>> shards, it could be difficult to detect deadlocks.
>>>
>>
> I wonder if Ashutosh's atomic foreign transactions patch would address any
>> issues inherent in such cases...
>>
>
> The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane
> behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down
> UPDATE/DELETE does, because inheritance_planner guarantees that all
> backends lock inheritance children in the same order to avoid needless
> deadlocks.
>
>
Will it be able to do it for row level locks, row level locking occurs
during updation of a row, so will it be possible to ensure the order of
locks on rows?

Will it handle deadlocks across different table partitions. Consider
a case as below:

T1
1. Updates row R1 of T1 on shard S1
2. Updates row R2 of T2 on shard S2

T2
1. Updates row R2 of T2 on shard S2
2. Updates row R1 of T1 on shard S1



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Etsuro Fujita

On 2015/09/02 14:28, Amit Langote wrote:

On 2015-09-02 PM 01:28, Amit Kapila wrote:

On Tue, Sep 1, 2015 at 9:48 PM, Robert Haas  wrote:

I'm not averse to making the "connect to the remote nodes" part of
this solution use something other than the FDW infrastructure at some
point in time if somebody's prepared to build something better.  On
the other hand, I think it's extremely clear that the FDW
infrastructure has a large amount of potential upon which we have
thoroughly failed to capitalize.  Patches have already been written
for UPDATE/DELETE pushdown and for join pushdown.



Will pushing down writes (Update/Delete) sufficient to maintain sane locking
behaviour and deadlock detection that can occur during writes on multiple
shards?  For example it could easily be the case where a single Update
statement could effect multiple shards and cause deadlock due to waits
across the nodes.  Now unless we have some distributed lock manager or
some other way to know the information of locks that happens across
shards, it could be difficult to detect deadlocks.



I wonder if Ashutosh's atomic foreign transactions patch would address any
issues inherent in such cases...


The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane 
behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down 
UPDATE/DELETE does, because inheritance_planner guarantees that all 
backends lock inheritance children in the same order to avoid needless 
deadlocks.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Amit Langote
On 2015-09-02 PM 01:28, Amit Kapila wrote:
>> On Tue, Sep 1, 2015 at 9:48 PM, Robert Haas  wrote:
>>>
>>> I'm not averse to making the "connect to the remote nodes" part of
>>> this solution use something other than the FDW infrastructure at some
>>> point in time if somebody's prepared to build something better.  On
>>> the other hand, I think it's extremely clear that the FDW
>>> infrastructure has a large amount of potential upon which we have
>>> thoroughly failed to capitalize.  Patches have already been written
>>> for UPDATE/DELETE pushdown and for join pushdown.
> 
> Will pushing down writes (Update/Delete) sufficient to maintain sane locking
> behaviour and deadlock detection that can occur during writes on multiple
> shards?  For example it could easily be the case where a single Update
> statement could effect multiple shards and cause deadlock due to waits
> across the nodes.  Now unless we have some distributed lock manager or
> some other way to know the information of locks that happens across
> shards, it could be difficult to detect deadlocks.
> 

I wonder if Ashutosh's atomic foreign transactions patch would address any
issues inherent in such cases...

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Amit Kapila
On Tue, Sep 1, 2015 at 4:25 PM, Bruce Momjian  wrote:
>
>
> The document opens a big question --- when queries can't be processed in
> a traditional top/down fashion, Citus has the goal of sending groups of
> results up the the coordinator, reordering them, then sending them back
> to the shards for further processing, basically using the shards as
> compute engines because the shards are no longer using local data to do
> their computations.  The two examples they give are COUNT(DISTINCT) and
> a join across two sharded tables ("CANADA").
>
> I assumed these queries were going to be solved by sending as digested
> data as possible to the coordinator, and having the coordinator complete
> any remaining processing.  I think we are going to need to decide if
> such "sending data back to shards" is something we are ever going to
> implement.  I can see FDWs _not_ working well for that use-case.
>

Here one related point to think is how do we envision to handle statement
requests, do we want to have centeralized coordinator which will process
all requests or the requests could be received by any node?
I think both kind of systems have their own pros and cons like if we want
to have centralized coordinator kind of system, then it might be limited
by the number of simultaneous requests it can handle and if go other way
like allow requests to be processed by each individual nodes, then we
have to think about replicating all meta-data on all nodes.

I think Collecting statistics about different objects is another thing which
can differ depending on the strategy we choose to allow requests.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Amit Kapila
> On Tue, Sep 1, 2015 at 9:48 PM, Robert Haas  wrote:
> >
> > On Tue, Sep 1, 2015 at 4:15 AM, Andres Freund 
wrote:
> > > On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote:
> > >> Uh, we already have a list of things we need to add to FDWs to make
them
> > >> work, and Citus Data has provided a document of more things that are
> > >> needed, https://goo.gl/vJWF85.  I am not sure how much bigger a red
flag
> > >> you want to confirm that everyone agrees that major FDW improvements
are
> > >> a requirement for this.
> > >
> > > Several people saying that the FDW infrastructure isn't sufficient
right
> > > now is pretty far from implying that all of them agree that the FDW
API
> > > is the way to go.
> > >
> > > I'm not sure myself. If it works out it's going to save us some work
and
> > > make it more realistic to get there sometime not too far off. But I'm
> > > afraid that the resulting system will feel like our current
partitioning
> > > implemenentation. Yes, it kinda works, but it's hard to get started,
it
> > > doesn't support too many features and you're kind afraid your
relatives
> > > will see what you've done.
> >
> > I'm not averse to making the "connect to the remote nodes" part of
> > this solution use something other than the FDW infrastructure at some
> > point in time if somebody's prepared to build something better.  On
> > the other hand, I think it's extremely clear that the FDW
> > infrastructure has a large amount of potential upon which we have
> > thoroughly failed to capitalize.  Patches have already been written
> > for UPDATE/DELETE pushdown and for join pushdown.

Will pushing down writes (Update/Delete) sufficient to maintain sane locking
behaviour and deadlock detection that can occur during writes on multiple
shards?  For example it could easily be the case where a single Update
statement could effect multiple shards and cause deadlock due to waits
across the nodes.  Now unless we have some distributed lock manager or
some other way to know the information of locks that happens across
shards, it could be difficult to detect deadlocks.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Petr Jelinek
On 2015-09-02 00:09, Josh Berkus wrote:
> On 09/01/2015 02:29 PM, Tomas Vondra wrote:
>> For example assume that there are just two shards in two separate data
>> centers, connected by a link with limited bandwidth. Now, let's assume
>> you always keep a local replica for failover. So you have A1+A2 in DC1,
>> B1+B2 in DC2. If you're in DC1, then writing data to B1 means you also
>> have to write data to B2 and wait for it. So either you send the data to
>> each node separately (consuming 2x the bandwidth), or send it to B1 and
>> let it propagate to B2 e.g. through sync rep.
>>
>> So while you may be right in single-DC deployments, with multi-DC
>> deployments the situation is quite different - not only that the network
>> bandwidth is not unlimited, but because latencies within DC may be a
>> fraction of latencies between the locations (to the extent that the
>> increase due to syncrep may be just noise). So the local replication may
>> be actually way faster.
>
> I'm not seeing how the above is better using syncrep than using shard
> copying?

Shard copying usually assumes that the origin node does the copy - the
data has to go twice through the slow connection. With replication you
can replicate locally over fast connection.

>
>> I can imagine forwarding the data between B1 and B2 even with a purely
>> sharding solution, but at that point you effectively re-implemented
>> syncrep.
>
> Not really, the mechanism is different and the behavior is different.
> One critical deficiency in using binary syncrep is that you can't do
> round-robin redundancy at all; every redundant node has to be an exact
> mirror of another node.  In a good HA distributed system, you want
> multiple shards per node, and you want each shard to be replicated to a
> different node, so that in the event of node failure you're not dumping
> the full load on one other server.
>

This assumes that we use binary replication, but we can reasonably use
logical replication which can quite easily do filtering of what's
replicated where.

>> IMHO the design has to address the multi-DC setups somehow. I think that
>> many of the customers who are so concerned about scaling to many shards
>> are also concerned about availability in case of DC outages, no?
>
> Certainly.  But users located in a single DC shouldn't pay the same
> overhead as users who are geographically spread.
>

Agreed, so we should support both ways, but I don't think it's necessary
to support both ways in version 0.1. It's just important to not paint
ourselves into a corner with design decisions that would make one of the
ways impossible.


>>> If we write to multiple copies as a part of the sharding feature,
>>> then that can be parallelized, so that we are waiting only as long as
>>> the slowest write (or in failure cases, as long as the shard
>>> timeout). Further, we can check for shard-copy health and update
>>> shard availability data with each user request, so that the ability
>>> to see stale/bad data is minimized.
>>
>> Again, this assumes infinite network bandwidth.
>
> In what way is the total network bandwitdh used in the system different
> for shard copying than for sync replication?
>

Again, when shards are distributed over multiple DCs (or actually even
multiple racks) the bandwidth and latency of local copy will be much
better then the one of the remote copy so the local replication can have
much lower impact on the cluster performance than remote shard copy will.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Peter Geoghegan
On Tue, Sep 1, 2015 at 10:17 AM, Robert Haas  wrote:
> On Tue, Sep 1, 2015 at 1:06 PM, Josh Berkus  wrote:
>> You're assuming that our primary bottleneck for writes is IO.  It's not
>> at present for most users, and it certainly won't be in the future.  You
>> need to move your thinking on systems resources into the 21st century,
>> instead of solving the resource problems from 15 years ago.
>
> Your experience doesn't match mine.  I find that it's frequently
> impossible to get the system to use all of the available CPU capacity,
> either because you're bottlenecked on locks or because you are
> bottlenecked on the  I/O subsystem, and with the locking improvements
> in newer versions, the former is becoming less and less common.

I think you're both right. I think that we need to fix the buffer
manager, to make its caching algorithm smarter. Since we're mostly
using the filesystem cache, this is particularly important for
PostgreSQL. We need to remember usage information for evicted blocks
for some period of time afterwards. This problem is largely a problem
with Postgres in particular, I suspect.

At the same time, I agree with Josh's assessment that long-term, we
are going to have the biggest problem with memory latency and memory
bandwidth, which are usually considered facets of CPU performance, and
with internal lock contention. Addressing the memory access bottleneck
dovetails with parallelism, in that it must be considered alongside
parallelism. Josh's "21st century" remark seems quite justified to me.
For a further example of this, check out my latest progress with
external sorting, which I plan to post later in the week. I/O isn't
the big problem there at all, and I now think we can make external
sorts close to internal sorts in performance across the board.

I imagine that Josh's experience is based on workloads that mostly fit
in shared_buffers, so I can see why you'd disagree if that was
something you've seen less of. I'll quote Hellerstein and Stonebraker
in 2007 [1]:

"Copying data in memory can be a serious bottleneck. Copies contribute
latency, consume CPU cycles, and can flood the CPU data. This fact is
often a surprise to people who have not operated or implemented a
database system, and assume that main-memory operations are “free”
compared to disk I/O. But in practice, throughput in a well-tuned
transaction processing DBMS is typically not I/O-bound. This is
achieved in high-end installations by purchasing sufficient disks and
RAM so that repeated page requests are absorbed by the buffer pool,
and disk I/Os are shared across the disk arms at a rate that can feed
the data appetite of all the processors in the system."

>> Our real future bottlenecks are:
>>
>> * ability to handle more than a few hundred connections
>> * locking limits on the scalability of writes
>> * ability to manage large RAM and data caches
>
> I do agree that all of those things are problems, FWIW.

These seem like our long term problems, to me.

[1] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf, page 213
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Bruce Momjian
On Tue, Sep  1, 2015 at 12:40:40PM -0400, Robert Haas wrote:
> On Tue, Sep 1, 2015 at 6:55 AM, Bruce Momjian  wrote:
> > I assumed these queries were going to be solved by sending as digested
> > data as possible to the coordinator, and having the coordinator complete
> > any remaining processing.  I think we are going to need to decide if
> > such "sending data back to shards" is something we are ever going to
> > implement.  I can see FDWs _not_ working well for that use-case.
> 
> I do think we are going to want to support that.  All the people I've
> talked to about parallel and distributed query processing agree that
> you need to do that sort of thing to get really good and scalable
> performance.  I think that we could make a lot of headway as compared
> with the status quo just by implementing more pushdown optimizations
> than we have today.  Right now, SELECT COUNT(*) FROM table will suck
> back the whole remote table and count the rows locally, and that's
> stupid.  We can fix that case with better pushdown logic.  We can also
> fix the case of N-way join nests where the joins are either on the
> partitioning key or to replicated tables.  But suppose you have a join
> between two tables which are sharded across the cluster but not on the
> partitioning key.  There's no way to push the join down, so all the
> work comes back to the coordinator, which is possibly OK if such
> queries are rare, but not so hot if they are frequent.

Let me clearer about what the Citus Data paper shows.  I said originally
that the data was sent to the coordinator, sorted, then resent to the
shards, but the document:

https://goo.gl/vJWF85

https://www.citusdata.com/blog/114-how-to-build-your-distributed-database

has the shards create the groups and the groups are sent to the other
shards.  For example, to do COUNT(DISTINCT) if you have three shards,
then each shard breaks its data into 3 buckets (1B in size), then the
first bucket from each of the three shards goes to the first shard, and
the second bucket goes to the second shared, etc.

Basically, they are doing map-reduce, and the shards are creating
additional batches that get shipped to other shards.  I can see FDWs not
working well in that case as you are really creating a new data layout
just for the query.  This explains why the XC/XL people are saying they
would use FDWs if they existed at the time they started development,
while the Citus Data people are saying they couldn't use FDWs as they
currently exist.  They probably both needed FDW improvements, but I
think the Citus Data features would need a lot more.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Josh Berkus
On 09/01/2015 02:29 PM, Tomas Vondra wrote:
> Hi,
> 
> On 09/01/2015 09:19 PM, Josh Berkus wrote:
>> Other way around, that is, having replication standbys as the only
>> method of redundancy requires either high data loss or high latency
>> for all writes.
> 
> I haven't said that. I said that we should allow that topology, not that
> it should be the only method of redundancy.

Ah, OK, I didn't understand you.  Of course I'm in favor of supporting
both methods of redundancy if we can.

>> In the case of sync rep, we are required to wait for at least double
>>  network lag time in order to do a single write ... making
>> write-scalability quite difficult.
> 
> Which assumes that latency (or rather the increase due to syncrep) is a
> problem for the use case. Which may be the case for many use cases, but
> certainly is not a problem for many BI/DWH use cases performing mostly
> large batch loads. In those cases the network bandwidth may be quite
> important resource.

I'll argue that BI/DW is the least interesting use case for mainstream
PostgreSQL because there are production-quality forks which do this
(mostly propietary, but we can work on that).  We really need a solution
which works for OLTP.

> For example assume that there are just two shards in two separate data
> centers, connected by a link with limited bandwidth. Now, let's assume
> you always keep a local replica for failover. So you have A1+A2 in DC1,
> B1+B2 in DC2. If you're in DC1, then writing data to B1 means you also
> have to write data to B2 and wait for it. So either you send the data to
> each node separately (consuming 2x the bandwidth), or send it to B1 and
> let it propagate to B2 e.g. through sync rep.
> 
> So while you may be right in single-DC deployments, with multi-DC
> deployments the situation is quite different - not only that the network
> bandwidth is not unlimited, but because latencies within DC may be a
> fraction of latencies between the locations (to the extent that the
> increase due to syncrep may be just noise). So the local replication may
> be actually way faster.

I'm not seeing how the above is better using syncrep than using shard
copying?

> I can imagine forwarding the data between B1 and B2 even with a purely
> sharding solution, but at that point you effectively re-implemented
> syncrep.

Not really, the mechanism is different and the behavior is different.
One critical deficiency in using binary syncrep is that you can't do
round-robin redundancy at all; every redundant node has to be an exact
mirror of another node.  In a good HA distributed system, you want
multiple shards per node, and you want each shard to be replicated to a
different node, so that in the event of node failure you're not dumping
the full load on one other server.

> IMHO the design has to address the multi-DC setups somehow. I think that
> many of the customers who are so concerned about scaling to many shards
> are also concerned about availability in case of DC outages, no?

Certainly.  But users located in a single DC shouldn't pay the same
overhead as users who are geographically spread.

> I don't follow. With sync rep we do know whether the copy is OK or not,
> because the node either confirms writes or not. The failover certainly
> is more complicated and is not immediate (to the extent of keeping a
> copy at the sharding level), but it's a question of trade-offs.
> 
> It's true we don't have auto-failover solution at the moment, but as I
> said - I can easily imagine most people using just sharding, while some
> deployments use syncrep with manual failover.

As long as direct shard copying is available, I'm happy.  I have no
complaints about additional mechanisms.

I'm bringing this up because the FDW proposal made at pgCon did not
include *any* mechanism for HA/redundancy, just some handwaving about
replication and/or BDR. This was one of the critical design failures of
PostgresXC.  A multinode system without automated node failover and
replacement is a low-availability system.

>> If we write to multiple copies as a part of the sharding feature,
>> then that can be parallelized, so that we are waiting only as long as
>> the slowest write (or in failure cases, as long as the shard
>> timeout). Further, we can check for shard-copy health and update
>> shard availability data with each user request, so that the ability
>> to see stale/bad data is minimized.
> 
> Again, this assumes infinite network bandwidth.

In what way is the total network bandwitdh used in the system different
for shard copying than for sync replication?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread David E. Wheeler
On Sep 1, 2015, at 1:47 PM, Robert Haas  wrote:

> Admittedly, there are some problems with snapshots here: if you don't
> do anything special about snapshots, then what you have here will be
> "eventually consistent" behavior.  But that might be suitable for some
> environments, such as very loosely coupled system where not all nodes
> are connected all the time.

Given that we’re discussing multi-node architectures here, you should expect 
that not all nodes will be connected at any time. Nodes fail, but the cluster 
should not.

> And, for those environments where you do
> need consistent snapshots, we can imagine ways to get that behavior,
> like having the GTM consider the transaction uncommitted until it's
> been logically replicated to every node.

Again, you need a way to deal with nodes going down. I can envision building a 
cluster with twelve nodes replicated to each of three 
geographically-distributed data centers. Each replication/sync model needs to 
be able to handle nodes going up and down, data centers or racks going up or 
down, and nodes being added and removed.

But even with smaller clusters, there’s no way around the fact that no system 
can guarantee that all nodes will be available at all times.

Best,

David



smime.p7s
Description: S/MIME cryptographic signature


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Bruce Momjian
On Tue, Sep  1, 2015 at 08:18:38AM -0700, Joshua Drake wrote:
> On 09/01/2015 02:48 AM, Bruce Momjian wrote:
> >On Tue, Sep  1, 2015 at 09:30:41AM +0530, Pavan Deolasee wrote:
> 
> >There is no question that using XC/XL will get us to a usable solution
> >faster, but see my recent post to Josh Berkus --- the additional code
> >will be so burdensome that I doubt it would be accepted. If it was, I
> >bet we would have considered it long ago.
> >
> >I think the only way we are going to get sharding into Postgres is to do
> >it in a way that enhances existing Postgres capabilities.
> 
> So that we have XL again?

Kind of.  If XC/XL used FDWs I think we would try to use their code
first.  The issue is that FDWs didn't exist at the time.  I would say
our first approach might be doing XC/XL again with FDWs.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Tomas Vondra

Hi,

On 09/01/2015 09:19 PM, Josh Berkus wrote:

On 09/01/2015 11:36 AM, Tomas Vondra wrote:

We want multiple copies of shards created by the sharding system
itself. Having a separate, and completely orthagonal, redundancy
system to the sharding system is overly burdensome on the DBA and
makes low-data-loss HA impossible.


IMHO it'd be quite unfortunate if the design would make it
impossible to combine those two features (e.g. creating standbys
for shards and failing over to them).

It's true that solving HA at the sharding level (by keeping
multiple copies of a each shard) may be simpler than combining
sharding and standbys, but I don't see why it makes low-data-loss
HA impossible.


Other way around, that is, having replication standbys as the only
method of redundancy requires either high data loss or high latency
for all writes.


I haven't said that. I said that we should allow that topology, not that 
it should be the only method of redundancy.




In the case of async rep, every time we fail over a node, the entire
cluser would need to roll back to the last common known-good replay
point, hence high data loss.

In the case of sync rep, we are required to wait for at least double
 network lag time in order to do a single write ... making
write-scalability quite difficult.


Which assumes that latency (or rather the increase due to syncrep) is a 
problem for the use case. Which may be the case for many use cases, but 
certainly is not a problem for many BI/DWH use cases performing mostly 
large batch loads. In those cases the network bandwidth may be quite 
important resource.


For example assume that there are just two shards in two separate data 
centers, connected by a link with limited bandwidth. Now, let's assume 
you always keep a local replica for failover. So you have A1+A2 in DC1, 
B1+B2 in DC2. If you're in DC1, then writing data to B1 means you also 
have to write data to B2 and wait for it. So either you send the data to 
each node separately (consuming 2x the bandwidth), or send it to B1 and 
let it propagate to B2 e.g. through sync rep.


So while you may be right in single-DC deployments, with multi-DC 
deployments the situation is quite different - not only that the network 
bandwidth is not unlimited, but because latencies within DC may be a 
fraction of latencies between the locations (to the extent that the 
increase due to syncrep may be just noise). So the local replication may 
be actually way faster.


I can imagine forwarding the data between B1 and B2 even with a purely 
sharding solution, but at that point you effectively re-implemented syncrep.


IMHO the design has to address the multi-DC setups somehow. I think that 
many of the customers who are so concerned about scaling to many shards 
are also concerned about availability in case of DC outages, no?


We should also consider support for custom topologies (not just a full 
mesh, or whatever we choose as the default/initial topology), which is 
somehow related.




Futher, if using replication the sharding system would have no way
to (a) find out immediately if a copy was bad and (b) fail over
quickly to a copy of the shard if the first requested copy was not
responding. With async replication, we also can't use multiple copies
of the same shard as a way to balance read workloads.


I don't follow. With sync rep we do know whether the copy is OK or not, 
because the node either confirms writes or not. The failover certainly 
is more complicated and is not immediate (to the extent of keeping a 
copy at the sharding level), but it's a question of trade-offs.


It's true we don't have auto-failover solution at the moment, but as I 
said - I can easily imagine most people using just sharding, while some 
deployments use syncrep with manual failover.




If we write to multiple copies as a part of the sharding feature,
then that can be parallelized, so that we are waiting only as long as
the slowest write (or in failure cases, as long as the shard
timeout). Further, we can check for shard-copy health and update
shard availability data with each user request, so that the ability
to see stale/bad data is minimized.


Again, this assumes infinite network bandwidth.



There are obvious problems with multiplexing writes, which you can
figure out if you knock pg_shard around a bit. But I really think
that solving those problems is the only way to go.

Mind you, I see a strong place for binary replication and BDR for
multi-region redundancy; you really don't want that to be part of
the sharding system if you're aiming for write scalability.


I haven't mentioned BDR at all, and given the async nature I don't have 
a clear idea of how it fits into the sharding world at this point.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
ht

Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Robert Haas
On Tue, Sep 1, 2015 at 3:19 PM, Josh Berkus  wrote:
> Mind you, I see a strong place for binary replication and BDR for
> multi-region redundancy; you really don't want that to be part of the
> sharding system if you're aiming for write scalability.

I mostly agree, but keep in mind that you don't have to replicate
everything.  A lot of people might find it useful to replicate the
data that is common to all nodes even if they don't replicate the
sharded data.

Admittedly, there are some problems with snapshots here: if you don't
do anything special about snapshots, then what you have here will be
"eventually consistent" behavior.  But that might be suitable for some
environments, such as very loosely coupled system where not all nodes
are connected all the time.  And, for those environments where you do
need consistent snapshots, we can imagine ways to get that behavior,
like having the GTM consider the transaction uncommitted until it's
been logically replicated to every node.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Josh Berkus
On 09/01/2015 11:36 AM, Tomas Vondra wrote:
>> We want multiple copies of shards created by the sharding system itself.
>>   Having a separate, and completely orthagonal, redundancy system to the
>> sharding system is overly burdensome on the DBA and makes low-data-loss
>> HA impossible.
> 
> IMHO it'd be quite unfortunate if the design would make it impossible to
> combine those two features (e.g. creating standbys for shards and
> failing over to them).
> 
> It's true that solving HA at the sharding level (by keeping multiple
> copies of a each shard) may be simpler than combining sharding and
> standbys, but I don't see why it makes low-data-loss HA impossible.

Other way around, that is, having replication standbys as the only
method of redundancy requires either high data loss or high latency for
all writes.

In the case of async rep, every time we fail over a node, the entire
cluser would need to roll back to the last common known-good replay
point, hence high data loss.

In the case of sync rep, we are required to wait for at least double
network lag time in order to do a single write ... making
write-scalability quite difficult.

Futher, if using replication the sharding system would have no way to
(a) find out immediately if a copy was bad and (b) fail over quickly to
a copy of the shard if the first requested copy was not responding.
With async replication, we also can't use multiple copies of the same
shard as a way to balance read workloads.

If we write to multiple copies as a part of the sharding feature, then
that can be parallelized, so that we are waiting only as long as the
slowest write (or in failure cases, as long as the shard timeout).
Further, we can check for shard-copy health and update shard
availability data with each user request, so that the ability to see
stale/bad data is minimized.

There are obvious problems with multiplexing writes, which you can
figure out if you knock pg_shard around a bit.  But I really think that
solving those problems is the only way to go.

Mind you, I see a strong place for binary replication and BDR for
multi-region redundancy; you really don't want that to be part of the
sharding system if you're aiming for write scalability.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Tomas Vondra

Hi,

On 09/01/2015 08:22 PM, Andres Freund wrote:

On 2015-09-01 14:11:21 -0400, Robert Haas wrote:

On Tue, Sep 1, 2015 at 2:04 PM, Tomas Vondra
 wrote:

Memory bandwidth, for example. It's quite difficult to spot, because the
intuition is that memory is fast, but thanks to improvements in storage (and
stagnation in RAM bandwidth), this is becoming a significant issue.


I'd appreciate any tips on how to spot problems of this type.  But
it's my impression that perf, top, vmstat, and other Linux performance
tools will count time spent waiting for memory as CPU time, not idle
time.  If that's correct, that wouldn't explain workloads where CPU
utilization doesn't reach 100%.  Rather, it would show up as CPU time
hitting 100% while tps remains low.


Yea.

-e bus-cycles is a good start to measure where bus traffic is
  relevant. Depending on the individual cpu other events can be helpful.


long-story: https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

It's from 2007 and only explains oprofile (chapter 7), which is mostly 
abandoned in favor of perf nowadays. Perf can produce similar stats, so 
the discussion is still valid. But it also shows cachegrind (valgrind 
module).


perf examples: http://www.brendangregg.com/perf.html

Most of the examples with "CPU" in the comment are relevant. Usually 
"perf stat" and "perf stat -d" are good starting points - once you get a 
lot of LLC misses or too many instructions per cycle, it's a sign of 
memory bandwidth problems.


Sadly, this is partially caused by our volcano-style executor and 
sharding alone can do nothing about that.





Process-management overhead is another thing we tend to ignore, but once you
get to many processes all willing to work at the same time, you need to
account for that.


Any tips on spotting problems in that area?


Not perfect, but -e context-switches (general context switches) and -e
syscalls:sys_enter_semop (for postgres enforced context switches) is
rather useful when combined with --call-graph dwarf ('fp' sometimes
doesn't see through libc which is most of the time not compiled with
-fno-omit-frame-pointer).


Right, this is about the best I'm aware of.

The problem often is not in the number of context switches, but in the 
fact that all the processes share the same (very limited) L caches on 
the CPU. Each process dirties the caches for the other processes, 
lowering the hit ratios. Which can be spotted using the commands above.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Tomas Vondra

Hi,

On 08/31/2015 10:16 PM, Josh Berkus wrote:

It's also important to recognize that there are three major use-cases
for write-scalable clustering:

* OLTP: small-medium cluster, absolute ACID consistency,
   bottlnecked on small writes per second
* DW: small-large cluster, ACID optional,
   bottlenecked on bulk reads/writes
* Web: medium to very large cluster, ACID optional,
   bottlenecked on # of connections

We cannot possibly solve all of the above at once, but to the extent
that we recognize all 3 use cases, we can build core features which
can be adapted to all of them.


It would be good to have a discussion about use-cases first - each of us 
is mostly concerned about the use cases they're dealing with, with 
bottlenecks specific to their environment. These three basic use-cases 
seem like a good start, but some of the details certainly don't match my 
experience ...


For example I can't see how ACID can be optional for the DWH use-case, 
but maybe there's a good explanation - I can imagine sacrificing various 
ACID properties at the node level, but I can't really imagine 
sacrificing any of the ACID properties for the cluster as a whole. So 
this would deserve some explanation.


I also don't share the view that write scalability is the only (or even 
main) issue, that we should aim to solve. For the business-intelligence 
use cases I've been working on recently, handling complex read-only 
ad-hoc queries is often much more important. And in those cases the 
bottleneck is often CPU and/or RAM.




I'm also going to pontificate that, for a future solution, we should
not focus on write *IO*, but rather on CPU and RAM. The reason for
this thinking is that, with the latest improvements in hardware and
9.5 improvements, it's increasingly rare for machines to be
bottlenecked on writes to the transaction log (or the heap). This has
some implications for system design. For example, solutions which
require all connections to go through a single master node do not
scale sufficiently to be worth bothering with.


+1


On some other questions from Mason:


Do we want multiple copies of shards, like the pg_shard approach? Or
keep things simpler and leave it up to the DBA to add standbys?


We want multiple copies of shards created by the sharding system itself.
  Having a separate, and completely orthagonal, redundancy system to the
sharding system is overly burdensome on the DBA and makes low-data-loss
HA impossible.


IMHO it'd be quite unfortunate if the design would make it impossible to 
combine those two features (e.g. creating standbys for shards and 
failing over to them).


It's true that solving HA at the sharding level (by keeping multiple 
copies of a each shard) may be simpler than combining sharding and 
standbys, but I don't see why it makes low-data-loss HA impossible.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Andres Freund
On 2015-09-01 14:11:21 -0400, Robert Haas wrote:
> On Tue, Sep 1, 2015 at 2:04 PM, Tomas Vondra
>  wrote:
> > Memory bandwidth, for example. It's quite difficult to spot, because the
> > intuition is that memory is fast, but thanks to improvements in storage (and
> > stagnation in RAM bandwidth), this is becoming a significant issue.
> 
> I'd appreciate any tips on how to spot problems of this type.  But
> it's my impression that perf, top, vmstat, and other Linux performance
> tools will count time spent waiting for memory as CPU time, not idle
> time.  If that's correct, that wouldn't explain workloads where CPU
> utilization doesn't reach 100%.  Rather, it would show up as CPU time
> hitting 100% while tps remains low.

Yea.

-e bus-cycles is a good start to measure where bus traffic is
 relevant. Depending on the individual cpu other events can be helpful.

> > Process-management overhead is another thing we tend to ignore, but once you
> > get to many processes all willing to work at the same time, you need to
> > account for that.
> 
> Any tips on spotting problems in that area?

Not perfect, but -e context-switches (general context switches) and -e
syscalls:sys_enter_semop (for postgres enforced context switches) is
rather useful when combined with --call-graph dwarf ('fp' sometimes
doesn't see through libc which is most of the time not compiled with
-fno-omit-frame-pointer).

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Robert Haas
On Tue, Sep 1, 2015 at 2:04 PM, Tomas Vondra
 wrote:
> Memory bandwidth, for example. It's quite difficult to spot, because the
> intuition is that memory is fast, but thanks to improvements in storage (and
> stagnation in RAM bandwidth), this is becoming a significant issue.

I'd appreciate any tips on how to spot problems of this type.  But
it's my impression that perf, top, vmstat, and other Linux performance
tools will count time spent waiting for memory as CPU time, not idle
time.  If that's correct, that wouldn't explain workloads where CPU
utilization doesn't reach 100%.  Rather, it would show up as CPU time
hitting 100% while tps remains low.

> Process-management overhead is another thing we tend to ignore, but once you
> get to many processes all willing to work at the same time, you need to
> account for that.

Any tips on spotting problems in that area?

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Tomas Vondra

Hi,

On 09/01/2015 07:17 PM, Robert Haas wrote:

On Tue, Sep 1, 2015 at 1:06 PM, Josh Berkus  wrote:

You're assuming that our primary bottleneck for writes is IO. It's
not at present for most users, and it certainly won't be in the
future. You need to move your thinking on systems resources into
the 21st century, instead of solving the resource problems from 15
years ago.


Your experience doesn't match mine.  I find that it's frequently
impossible to get the system to use all of the available CPU
capacity, either because you're bottlenecked on locks or because you
are bottlenecked on the  I/O subsystem, and with the locking
improvements in newer versions, the former is becoming less and less
common. Amit's recent work on scalability demonstrates this trend: he
goes looking for lock bottlenecks, and finds problems that only occur
at 128+ concurrent connections running full tilt.  The patches show
limited benefit - a few percentage points - at lesser concurrency
levels.  Either there are other locking bottlenecks that limit
performance at lower client counts but which mysteriously disappear
as concurrency increases, which I would find surprising, or the limit
is somewhere else.  I haven't seen any convincing evidence that the
I/O subsystem is the bottleneck, but I'm having a hard time figuring
out what else it could be.


Memory bandwidth, for example. It's quite difficult to spot, because the 
intuition is that memory is fast, but thanks to improvements in storage 
(and stagnation in RAM bandwidth), this is becoming a significant issue.


Process-management overhead is another thing we tend to ignore, but once 
you get to many processes all willing to work at the same time, you need 
to account for that.


Of course, this applies differently to different sharding use cases. For 
example analytics workloads have serious issues with memory bandwidth, 
but not so much with process management overhead (because the number of 
connections is usually about number of cores). Use cases with many 
clients (in web-scale use cases) tends to run into both (all the 
processes also have to share all the caches, killing them).


I don't know if sharding can help solving (or at least improve) these 
issues. And if sharding in general can, I don't know if it still holds 
for FDW-based solution.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Josh Berkus
On 09/01/2015 10:17 AM, Robert Haas wrote:
> On Tue, Sep 1, 2015 at 1:06 PM, Josh Berkus  wrote:
>> Any sharding solution worth bothering with will solve some or all of the
>> above by extending our ability to process requests across multiple
>> nodes.  Any solution which does not is merely an academic curiosity.
> 
> I think the right solution to those problems is to attack them
> head-on.  Sharding solutions should cater to use cases where using all
> the resources of one machine isn't sufficient no matter how
> efficiently we do it.

As long as "all the resources" != "just IO", I'm completely on board
with that.  The reason I raised this is that the initial FDW-based
proposals pretty much scale IO and nothing else.

pg_shard also currently only scales IO, but they're working on that.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Joshua D. Drake

On 09/01/2015 10:06 AM, Josh Berkus wrote:

On 09/01/2015 02:39 AM, Bruce Momjian wrote:

On Mon, Aug 31, 2015 at 01:16:21PM -0700, Josh Berkus wrote:



Our real future bottlenecks are:

* ability to handle more than a few hundred connections


This, 1000 times this. No a connection pooler doesn't help, we are using 
1000 connections WITH a connection pooler. The amount of data centric, 
always connected (even if stateless) apps that are out there is 
astounding. Our performance curve is better (especially since Robert's 
fantastic work on 9.2) but it isn't where a lot of the demanding users 
need to be.


We are also seeing a vast uptick in always connected apps that aren't 
web centric (or the client at least isn't). In other words we are seeing 
situations where 1000 connections is 1000 users, connected.


JD



--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Robert Haas
On Tue, Sep 1, 2015 at 1:06 PM, Josh Berkus  wrote:
> You're assuming that our primary bottleneck for writes is IO.  It's not
> at present for most users, and it certainly won't be in the future.  You
> need to move your thinking on systems resources into the 21st century,
> instead of solving the resource problems from 15 years ago.

Your experience doesn't match mine.  I find that it's frequently
impossible to get the system to use all of the available CPU capacity,
either because you're bottlenecked on locks or because you are
bottlenecked on the  I/O subsystem, and with the locking improvements
in newer versions, the former is becoming less and less common.
Amit's recent work on scalability demonstrates this trend: he goes
looking for lock bottlenecks, and finds problems that only occur at
128+ concurrent connections running full tilt.  The patches show
limited benefit - a few percentage points - at lesser concurrency
levels.  Either there are other locking bottlenecks that limit
performance at lower client counts but which mysteriously disappear as
concurrency increases, which I would find surprising, or the limit is
somewhere else.  I haven't seen any convincing evidence that the I/O
subsystem is the bottleneck, but I'm having a hard time figuring out
what else it could be.

> Our real future bottlenecks are:
>
> * ability to handle more than a few hundred connections
> * locking limits on the scalability of writes
> * ability to manage large RAM and data caches

I do agree that all of those things are problems, FWIW.

> Any sharding solution worth bothering with will solve some or all of the
> above by extending our ability to process requests across multiple
> nodes.  Any solution which does not is merely an academic curiosity.

I think the right solution to those problems is to attack them
head-on.  Sharding solutions should cater to use cases where using all
the resources of one machine isn't sufficient no matter how
efficiently we do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Josh Berkus
On 09/01/2015 02:39 AM, Bruce Momjian wrote:
> On Mon, Aug 31, 2015 at 01:16:21PM -0700, Josh Berkus wrote:
>> I'm also going to pontificate that, for a future solution, we should not
>> focus on write *IO*, but rather on CPU and RAM. The reason for this
>> thinking is that, with the latest improvements in hardware and 9.5
>> improvements, it's increasingly rare for machines to be bottlenecked on
>> writes to the transaction log (or the heap). This has some implications
>> for system design.  For example, solutions which require all connections
>> to go through a single master node do not scale sufficiently to be worth
>> bothering with.
> 
> Well, I highlighted write IO for sharding because sharding is the only
> solution that allows write scaling.  If we want to scale CPU, we are
> better off using server parallelism, and to scale CPU and RAM, a
> multi-master/BDR solution seems best.  (Multi-master doesn't do write
> scaling because you eventually have to write all the data to each node.)

You're assuming that our primary bottleneck for writes is IO.  It's not
at present for most users, and it certainly won't be in the future.  You
need to move your thinking on systems resources into the 21st century,
instead of solving the resource problems from 15 years ago.

Currently, CPU resources and locking are the primary bottlenecks on
writing for the vast majority of the hundreds of servers I tune every
year.  This even includes AWS, with EBS's horrible latency; even in that
environment, most users can outstrip PostgreSQL's ability to handle
requests by getting 20K PRIOPs.

Our real future bottlenecks are:

* ability to handle more than a few hundred connections
* locking limits on the scalability of writes
* ability to manage large RAM and data caches


The only place where IO becomes the bottleneck is for the
batch-processing, high-throughput DW case ... and I would argue that
existing forks already handle that case.

Any sharding solution worth bothering with will solve some or all of the
above by extending our ability to process requests across multiple
nodes.  Any solution which does not is merely an academic curiosity.

> For these reasons, I think sharding has a limited use, and hence, I
> don't think the community will be willing to add a lot of code just to
> enable auto-sharding.  I think it has to be done in a way that adding
> sharding also gives other benefits, like better FDWs and cross-node ACID
> control.
> 
> In summary, I don't think adding a ton of code just to do sharding will
> be acceptable.  A corollary of that, is that if FDWs are unable to
> provide useful sharding, I don't see an acceptable way of adding
> built-in sharding to Postgres.

So, while I am fully in agreement with you that having side benefits to
our sharding tools, I think you're missing the big picture entirely.  In
a few years, clustered/sharded PostgreSQL will be the default
installation, or we'll be a legacy database.  Single-node and
single-master databases are rapidly becoming history.

>From my perspective, we don't need an awkward, limited, bolt-on solution
for write-scaling.  We need something which will become core to how
PostgreSQL works.  I just don't see us getting there with the described
FDW approach, which is why I keep raising issues with it.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Robert Haas
On Tue, Sep 1, 2015 at 7:01 AM, Mason S  wrote:
> For efficient internodes joins with row shipping, FDWs may also not be easy
> to do. Maybe it is possible if we optionally pass in lists of other nodes
> and information about how they are partitioned so data knows where to get
> shipped.
>
> A challenge for planning with arbitrary copies of different shards is that
> sometimes you may be able to push down joins, sometimes not. Planning and
> execution get ugly. Maybe this can be simplified by parent-child tables
> following the same partitioning scheme.

This gets at a problem which Ozgun also mentioned in his Google
document, and which I also discussed with Etsuro Fujita at PGCon: good
query planning requires good metadata, and we don't really have that
today.

I think that a big part of the goal of the declarative partitioning
work that Amit Langote has recently been undertaking is to provide a
catalog representation of the partitioning structure that is easy to
work with, as opposed to just having a bunch of CHECK constraints that
you have to try to reason about.  That's one part of the solution.

You also might need to know more about the remote table than is
captured by the column and data type list.  Most particularly, you
might want to know what indexes exist on the remote side, but
currently, to figure out that out, you'd need to send queries to
retrieve that information to the remote side every time you do
planning, or maybe you could contrive a session-lifespan cache.
That's pretty annoying.

One idea for solving this problem is to allow CREATE INDEX on foreign
tables, but I don't like that much.  There's no guarantee that the
remote side is a PostgreSQL instance, and if it isn't, the relevant
details about the indexes that exist may not be convenient to
represent in our catalogs.  Heck, that can be true even if it is a
PostgreSQL instance, if the remote side relies on an AM or a function
that doesn't exist locally.  But even if both sides are PostgreSQL
instances using only btree indexes on raw columns, now you've put the
burden on the DBA to make sure that the index definitions on the local
and remote sides match, and that's a pain in the neck.

What seems better to me is to allow ANALYZE of a foreign table a place
to record an arbitrary blob of metadata about the remote side that it
can then get access to during planning.  Then, you can record details
about indexes, or statistics that don't fit into the mold of
pg_statistic, or really, anything else you're going to need to figure
out the best plan, and if the DBA changes the configuration on the
remote side, they don't need to update the local configuration to
match; a re-ANALYZE will do the trick.

I'm open to other ideas as well.  Repartitioning will also arise for
an all-local parallel join, and it would be nice if the planner smarts
could be shared between that case and the remote-table case.
Therefore, while I suspect that some of the logic here will end up
inside one or more FDWs, I'm pretty confident that a significant chunk
of it needs to go into the core optimizer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Robert Haas
On Tue, Sep 1, 2015 at 6:55 AM, Bruce Momjian  wrote:
> I assumed these queries were going to be solved by sending as digested
> data as possible to the coordinator, and having the coordinator complete
> any remaining processing.  I think we are going to need to decide if
> such "sending data back to shards" is something we are ever going to
> implement.  I can see FDWs _not_ working well for that use-case.

I do think we are going to want to support that.  All the people I've
talked to about parallel and distributed query processing agree that
you need to do that sort of thing to get really good and scalable
performance.  I think that we could make a lot of headway as compared
with the status quo just by implementing more pushdown optimizations
than we have today.  Right now, SELECT COUNT(*) FROM table will suck
back the whole remote table and count the rows locally, and that's
stupid.  We can fix that case with better pushdown logic.  We can also
fix the case of N-way join nests where the joins are either on the
partitioning key or to replicated tables.  But suppose you have a join
between two tables which are sharded across the cluster but not on the
partitioning key.  There's no way to push the join down, so all the
work comes back to the coordinator, which is possibly OK if such
queries are rare, but not so hot if they are frequent.

To leverage the full CPU power of the cluster in such a case, you need
to be able to shuffle data around between the nodes.  You pick one of
the two tables being joined, and based on the partitioning key of that
table, each node examines the other table and, for each row, sends it
to the machine where it MIGHT have one or more join partners.  Then
each node can join its shard of the first table against the rows from
the second table that were sent to it.

Now the question is, where should the code that does all of this live?
 postgres_fdw?  Some new, sharding-specific FDW?  In core?  I don't
know for sure, but what I do know is that we could make a lot of
progress over where we are today by just improving postgres_fdw, and I
don't think those improvements are even all that difficult.  If we
decide we need to implement something new, it's going to be a huge
project that will take years to complete, with uncertain results.  I'd
rather have a postgres_fdw-based implementation that is imperfect and
can't handle some kinds of queries in 9.6 than a promise that by 9.9
we'll have something really great that handles MPP perfectly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Joshua D. Drake

On 09/01/2015 09:08 AM, Robert Haas wrote:

On Tue, Sep 1, 2015 at 12:00 AM, Pavan Deolasee



 From my point of view, and EnterpriseDB's point of view, anything that
doesn't go into the core PostgreSQL distribution isn't really getting
us where we need to be.  If there's code in XL that would be valuable
to merge into core PostgreSQL, then let's do it.  If the code cannot
be used but there are lessons we can learn that will make what does go
into core PostgreSQL better, let's learn them.


I think that is pretty much anybody in the industry's point of view and 
I agree with it. Although there is definitely an argument to be had 
about whether it is a module or it is in core proper, no matter what it 
should be a .Org project.


You are also correct on the replication statement. Replication should 
have never been outside of core and I say that as the company that wrote 
a replication solution that even Fortune 50 companies used.


I think that we should consider that there are very smart, very talented 
engineers already devoting time to their respective communities (XL/XC) 
that may very much enjoy being part of our community so we don't have to 
reinvent the wheel, again.


JD

--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Robert Haas
On Tue, Sep 1, 2015 at 4:15 AM, Andres Freund  wrote:
> On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote:
>> Uh, we already have a list of things we need to add to FDWs to make them
>> work, and Citus Data has provided a document of more things that are
>> needed, https://goo.gl/vJWF85.  I am not sure how much bigger a red flag
>> you want to confirm that everyone agrees that major FDW improvements are
>> a requirement for this.
>
> Several people saying that the FDW infrastructure isn't sufficient right
> now is pretty far from implying that all of them agree that the FDW API
> is the way to go.
>
> I'm not sure myself. If it works out it's going to save us some work and
> make it more realistic to get there sometime not too far off. But I'm
> afraid that the resulting system will feel like our current partitioning
> implemenentation. Yes, it kinda works, but it's hard to get started, it
> doesn't support too many features and you're kind afraid your relatives
> will see what you've done.

I'm not averse to making the "connect to the remote nodes" part of
this solution use something other than the FDW infrastructure at some
point in time if somebody's prepared to build something better.  On
the other hand, I think it's extremely clear that the FDW
infrastructure has a large amount of potential upon which we have
thoroughly failed to capitalize.  Patches have already been written
for UPDATE/DELETE pushdown and for join pushdown.  Those patches have
been around for some time, but progress has been slow.  Core
infrastructure exists to allow sort pushdown, but nobody's done
anything with it.  Aggregate pushdown hasn't happened yet due to the
dependency on upper planner path-ification, but it's not as if some
alternative to the FDW interface is going to dodge that problem.

It would be a bad idea to cling blindly to the FDW infrastructure if
it's fundamentally inadequate to do what we want.  On the other hand,
it would also be a bad idea to set about recreating it without a
really good reason, and - just to take one example - the fact that it
doesn't currently push down DML operations to the remote side is not a
really good reason to rewrite the whole thing.  On the contrary, it's
a reason to put some energy into the already-written patch which
implements that optimization.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Robert Haas
On Tue, Sep 1, 2015 at 12:00 AM, Pavan Deolasee
 wrote:
> My worry is that if we start implementing them again from scratch, it will
> take a few years before we get them in a usable state. What XC/XL lacked is
> probably a Robert Haas or a Tom Lane who could look at the work and suggest
> major edits. If that had happened, the quality of the product could have
> been much better today. I don't mean to derate the developers who worked on
> XC/XL, but there is no harm in accepting that if someone with a much better
> understanding of the whole system was part of the team, that would have
> positively impacted the project. Is that an angle worth exploring? Does it
> make sense to commit some more resources to say XC or XL and try to improve
> the quality of the product even further? To be honest, XL is in far far
> better shape (haven't really tried XC in a while) and some more QA/polishing
> can make it production ready much sooner.

>From my point of view, and EnterpriseDB's point of view, anything that
doesn't go into the core PostgreSQL distribution isn't really getting
us where we need to be.  If there's code in XL that would be valuable
to merge into core PostgreSQL, then let's do it.  If the code cannot
be used but there are lessons we can learn that will make what does go
into core PostgreSQL better, let's learn them.  However, I don't think
it's serving anybody very well that we have the XC fork, and multiple
forks of the XC fork, floating around out there and people are working
on those instead of working on core PostgreSQL.  The reality is that
we don't have enough brainpower to spread it across 2 or 3 or 4 or 5
different projects and have all of them be good.  The reality is,
also, that horizontal scalability isn't an optional feature.  There
was a point in time at which the PostgreSQL project's official policy
on replication was that it did not belong in core.  That was a bad
policy; thankfully, it was reversed, and the result was Hot Standby
and Streaming Replication, incredibly important technologies without
which we would not be where we are today. Horizontal scalability is
just as essential.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread David Fetter
On Tue, Sep 01, 2015 at 10:15:27AM +0200, Andres Freund wrote:
> On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote:
> > Uh, we already have a list of things we need to add to FDWs to make them
> > work, and Citus Data has provided a document of more things that are
> > needed, https://goo.gl/vJWF85.  I am not sure how much bigger a red flag
> > you want to confirm that everyone agrees that major FDW improvements are
> > a requirement for this.
> 
> Several people saying that the FDW infrastructure isn't sufficient right
> now is pretty far from implying that all of them agree that the FDW API
> is the way to go.
> 
> I'm not sure myself. If it works out it's going to save us some work and
> make it more realistic to get there sometime not too far off. But I'm
> afraid that the resulting system will feel like our current partitioning
> implemenentation. Yes, it kinda works, but it's hard to get started, it
> doesn't support too many features and you're kind afraid your relatives
> will see what you've done.

Whatever we decide on, we can only count on built-in multi-node being
adopted if all the needed bits needed ship with every PostgreSQL
installation.  If we require people do Install More Software™ in order
to get a feature, we're going to lose a majority of our potential
base.

If it turns out we need the PostgreSQL FDW, and I believe some
consensus is starting to gel around that, I can see not installing it
in template1 by default.  Even that's a questionable decision, as
merely having the software in place does not credibly increase the
attack surface, and does up the installation procedures by a fallible
step.

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Joshua D. Drake

On 09/01/2015 02:58 AM, Mark Kirkwood wrote:

On 01/09/15 21:41, Bruce Momjian wrote:



I'm thinking that partitioning and sharding are two different things:

Partitioning is about reducing the amount of table data accessed and
also perhaps easing admin activities (archiving/removing old stuff etc).

Sharding is a about parallelism and redundancy...copies of stuff in
different places and concurrent access by virtue of it being on
different nodes!


In our world, they are complimentary. Consider partitioning that uses 
FDW tables with proper plan push down etc




Now *maybe* FDW is a good way to approach this, but really would be nice
to see a more rigorous analysis (I note that like XC and XL, Greenplum
looked at the existing mechanisms around at the time and ended up
writing their own). Now I'm aware that things have moved on - but I
think there needs to be a proper discussion about design and what we
think distributed data/sharding etc should provide *before* grabbing
hold of FDW as *the answer*!


Agreed.

Sincerely,

Joshua D. Drake



--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Joshua D. Drake

On 09/01/2015 02:48 AM, Bruce Momjian wrote:

On Tue, Sep  1, 2015 at 09:30:41AM +0530, Pavan Deolasee wrote:



There is no question that using XC/XL will get us to a usable solution
faster, but see my recent post to Josh Berkus --- the additional code
will be so burdensome that I doubt it would be accepted. If it was, I
bet we would have considered it long ago.

I think the only way we are going to get sharding into Postgres is to do
it in a way that enhances existing Postgres capabilities.


So that we have XL again?

This reads like: We don't want to merge a bunch of code that a non .Org 
project wrote. Instead we want to write a bunch of code that will all 
but duplicate what that other project wrote because ?


Don't get me wrong, I am all about this goal and if XL truly doesn't fit 
that's cool but I think the idea that our project will somehow reinvent 
a better wheel is naive.



(I think we got a little too cute in enhancing existing Postgres
capabilities to add partitioning, and I think we need to fix that.
Hopefully we don't do the same thing with sharding.)


Not cute, lazy. We implemented the barest minimum within our existing 
infrastructure and we have been kicking ourselves ever since.


Sincerely,

JD



--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Mason S
On Tue, Sep 1, 2015 at 6:55 AM, Bruce Momjian  wrote:

> On Mon, Aug 31, 2015 at 11:23:58PM -0300, Alvaro Herrera wrote:
> > Bruce Momjian wrote:
> >
> > > My hope is that many FDW improvements will benefit sharding and
> > > non-sharding workloads, but I bet some improvements are going to be
> > > sharding-specific.  I would say we are still in the exploratory stage,
> > > but based on the number of people who care about this feature and want
> > > to be involved, I think we are off to a very good start.  :-)
> >
> > Having lots of interested people doesn't help with some problems,
> > though.  The Citus document says:
> >
> >   And the issue with these four limitations wasn't with foreign
> >   data wrappers. We wrote mongo_fdw and cstore_fdw, and we're
> >   quite happy with the contract FDWs provide. The problem was that
> >   we were trying to retrofit an API for something that it was
> >   fundamentally not designed to do.
>
> I had a chance to review the Citus Data document just now:
>
> https://goo.gl/vJWF85
>
> Particularly, it links to this document, which is clearer about the
> issues they are trying to solve:
>
>
> https://www.citusdata.com/blog/114-how-to-build-your-distributed-database
>
> The document opens a big question --- when queries can't be processed in
> a traditional top/down fashion, Citus has the goal of sending groups of
> results up the the coordinator, reordering them, then sending them back
> to the shards for further processing, basically using the shards as
> compute engines because the shards are no longer using local data to do
> their computations.  The two examples they give are COUNT(DISTINCT) and
> a join across two sharded tables ("CANADA").
>
> I assumed these queries were going to be solved by sending as digested
> data as possible to the coordinator, and having the coordinator complete
> any remaining processing.  I think we are going to need to decide if
> such "sending data back to shards" is something we are ever going to
> implement.  I can see FDWs _not_ working well for that use-case.
>
>
For efficient internodes joins with row shipping, FDWs may also not be easy
to do. Maybe it is possible if we optionally pass in lists of other nodes
and information about how they are partitioned so data knows where to get
shipped.

A challenge for planning with arbitrary copies of different shards is that
sometimes you may be able to push down joins, sometimes not. Planning and
execution get ugly. Maybe this can be simplified by parent-child tables
following the same partitioning scheme.

Mason


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Bruce Momjian
On Mon, Aug 31, 2015 at 11:23:58PM -0300, Alvaro Herrera wrote:
> Bruce Momjian wrote:
> 
> > My hope is that many FDW improvements will benefit sharding and
> > non-sharding workloads, but I bet some improvements are going to be
> > sharding-specific.  I would say we are still in the exploratory stage,
> > but based on the number of people who care about this feature and want
> > to be involved, I think we are off to a very good start.  :-)
> 
> Having lots of interested people doesn't help with some problems,
> though.  The Citus document says:
> 
>   And the issue with these four limitations wasn't with foreign
>   data wrappers. We wrote mongo_fdw and cstore_fdw, and we're
>   quite happy with the contract FDWs provide. The problem was that
>   we were trying to retrofit an API for something that it was
>   fundamentally not designed to do.

I had a chance to review the Citus Data document just now:

https://goo.gl/vJWF85

Particularly, it links to this document, which is clearer about the
issues they are trying to solve:


https://www.citusdata.com/blog/114-how-to-build-your-distributed-database

The document opens a big question --- when queries can't be processed in
a traditional top/down fashion, Citus has the goal of sending groups of
results up the the coordinator, reordering them, then sending them back
to the shards for further processing, basically using the shards as
compute engines because the shards are no longer using local data to do
their computations.  The two examples they give are COUNT(DISTINCT) and
a join across two sharded tables ("CANADA").

I assumed these queries were going to be solved by sending as digested
data as possible to the coordinator, and having the coordinator complete
any remaining processing.  I think we are going to need to decide if
such "sending data back to shards" is something we are ever going to
implement.  I can see FDWs _not_ working well for that use-case.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Mason S
On Mon, Aug 31, 2015 at 3:08 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Mon, Aug 31, 2015 at 9:48 PM, Mason S  wrote:
>
>>
>>>  We also a bit disappointed by Huawei position about CSN patch, we hoped
>>> to use for  our XTM.
>>>
>>
>> Disappointed in what way? Moving to some sort of CSN approach seems to
>> open things up for different future ideas. In the short term, it would mean
>> replacing potentially large snapshots and longer visibility checks. In the
>> long term, perhaps CSN could help simplify the design of multi-master
>> replication schemes.
>>
>
> We are disappointed because at PGCon talk Huawei announced publishing of
> their CSN patch and further work in this direction together with community.
> However, it's even not published yet despite all the promises. Nobody from
> Huawei answers CSN thread in the hackers.
> So, I think we got nothing from Huawei except teasing and should rely only
> on ourselves. That is disappointing.
>
>
Oh, I see. I was in contact with some of the folks involved a few weeks
ago.  The intention is to do a bit more work on the patch and then post it.
This should be happening "soon".

Thanks,

Mason


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Mark Kirkwood

On 01/09/15 21:41, Bruce Momjian wrote:


Well, reworking our partitioning system is one of the things required
for sharding, so at least we will clean up one mess while we create
another.  ;-)

Seem my post to Josh Berkus just now --- I think if we don't use FDWs,
that sharding is such a limited use-case that we will not implement it
inside of Postgres.



I'm thinking that partitioning and sharding are two different things:

Partitioning is about reducing the amount of table data accessed and 
also perhaps easing admin activities (archiving/removing old stuff etc).


Sharding is a about parallelism and redundancy...copies of stuff in 
different places and concurrent access by virtue of it being on 
different nodes!


Now *maybe* FDW is a good way to approach this, but really would be nice 
to see a more rigorous analysis (I note that like XC and XL, Greenplum 
looked at the existing mechanisms around at the time and ended up 
writing their own). Now I'm aware that things have moved on - but I 
think there needs to be a proper discussion about design and what we 
think distributed data/sharding etc should provide *before* grabbing 
hold of FDW as *the answer*!


Regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Bruce Momjian
On Tue, Sep  1, 2015 at 09:30:41AM +0530, Pavan Deolasee wrote:
> My worry is that if we start implementing them again from scratch, it will 
> take
> a few years before we get them in a usable state. What XC/XL lacked is 
> probably
> a Robert Haas or a Tom Lane who could look at the work and suggest major 
> edits.
> If that had happened, the quality of the product could have been much better
> today. I don't mean to derate the developers who worked on XC/XL, but there is
> no harm in accepting that if someone with a much better understanding of the
> whole system was part of the team, that would have positively impacted the
> project. Is that an angle worth exploring? Does it make sense to commit some
> more resources to say XC or XL and try to improve the quality of the product
> even further? To be honest, XL is in far far better shape (haven't really 
> tried
> XC in a while) and some more QA/polishing can make it production ready much
> sooner.

There is no question that using XC/XL will get us to a usable solution
faster, but see my recent post to Josh Berkus --- the additional code
will be so burdensome that I doubt it would be accepted. If it was, I
bet we would have considered it long ago.

I think the only way we are going to get sharding into Postgres is to do
it in a way that enhances existing Postgres capabilities.  

(I think we got a little too cute in enhancing existing Postgres
capabilities to add partitioning, and I think we need to fix that.
Hopefully we don't do the same thing with sharding.)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Bruce Momjian
On Tue, Sep  1, 2015 at 10:15:27AM +0200, Andres Freund wrote:
> On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote:
> > Uh, we already have a list of things we need to add to FDWs to make them
> > work, and Citus Data has provided a document of more things that are
> > needed, https://goo.gl/vJWF85.  I am not sure how much bigger a red flag
> > you want to confirm that everyone agrees that major FDW improvements are
> > a requirement for this.
> 
> Several people saying that the FDW infrastructure isn't sufficient right
> now is pretty far from implying that all of them agree that the FDW API
> is the way to go.
> 
> I'm not sure myself. If it works out it's going to save us some work and
> make it more realistic to get there sometime not too far off. But I'm
> afraid that the resulting system will feel like our current partitioning
> implemenentation. Yes, it kinda works, but it's hard to get started, it
> doesn't support too many features and you're kind afraid your relatives
> will see what you've done.

Well, reworking our partitioning system is one of the things required
for sharding, so at least we will clean up one mess while we create
another.  ;-)

Seem my post to Josh Berkus just now --- I think if we don't use FDWs,
that sharding is such a limited use-case that we will not implement it
inside of Postgres.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Bruce Momjian
On Mon, Aug 31, 2015 at 01:16:21PM -0700, Josh Berkus wrote:
> I'm also going to pontificate that, for a future solution, we should not
> focus on write *IO*, but rather on CPU and RAM. The reason for this
> thinking is that, with the latest improvements in hardware and 9.5
> improvements, it's increasingly rare for machines to be bottlenecked on
> writes to the transaction log (or the heap). This has some implications
> for system design.  For example, solutions which require all connections
> to go through a single master node do not scale sufficiently to be worth
> bothering with.

Well, I highlighted write IO for sharding because sharding is the only
solution that allows write scaling.  If we want to scale CPU, we are
better off using server parallelism, and to scale CPU and RAM, a
multi-master/BDR solution seems best.  (Multi-master doesn't do write
scaling because you eventually have to write all the data to each node.)

For read-scaling, having the read queries go to streaming slave seems
best.

Only sharding allows for write scaling, but only in very limited
use-cases, where the queries use multiple shards and the
sharding/communication overhead doesn't overwhelm the benefits.

For these reasons, I think sharding has a limited use, and hence, I
don't think the community will be willing to add a lot of code just to
enable auto-sharding.  I think it has to be done in a way that adding
sharding also gives other benefits, like better FDWs and cross-node ACID
control.

In summary, I don't think adding a ton of code just to do sharding will
be acceptable.  A corollary of that, is that if FDWs are unable to
provide useful sharding, I don't see an acceptable way of adding
built-in sharding to Postgres.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-01 Thread Andres Freund
On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote:
> Uh, we already have a list of things we need to add to FDWs to make them
> work, and Citus Data has provided a document of more things that are
> needed, https://goo.gl/vJWF85.  I am not sure how much bigger a red flag
> you want to confirm that everyone agrees that major FDW improvements are
> a requirement for this.

Several people saying that the FDW infrastructure isn't sufficient right
now is pretty far from implying that all of them agree that the FDW API
is the way to go.

I'm not sure myself. If it works out it's going to save us some work and
make it more realistic to get there sometime not too far off. But I'm
afraid that the resulting system will feel like our current partitioning
implemenentation. Yes, it kinda works, but it's hard to get started, it
doesn't support too many features and you're kind afraid your relatives
will see what you've done.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-08-31 Thread Pavan Deolasee
On Tue, Sep 1, 2015 at 3:17 AM, Robert Haas  wrote:

>
>
> It seems to me that sharding consists of (1) breaking your data set up
> into shards, (2) possibly replicating some of those shards onto
> multiple machines, and then (3) being able to access the remote data
> from local queries.   As far as (1) is concerned, we need declarative
> partitioning, which is being worked on by Amit Langote.  As far as (2)
> is concerned, I hope and expect BDR, or technology derived therefrom,
> to eventually fill that need.  As far as (3) is concerned, why
> wouldn't we use the foreign data wrapper interface, and specifically
> postgres_fdw?  That interface was designed for the explicit purpose of
> allowing access to remote data sources, and a lot of work has been put
> into it, so it would be highly surprising if we decided to throw that
> away and develop something completely new from the ground up.
>
> It's true that postgres_fdw doesn't do everything we need yet.  The
> new join pushdown hooks aren't used by postgres_fdw yet, and the API
> itself has some bugs with EvalPlanQual handling.  Aggregate pushdown
> is waiting on upper planner path-ification.   DML pushdown doesn't
> exist yet, and the hooks that would enable pushdown of ORDER BY
> clauses to the remote side aren't being used by postgres_fdw.  But all
> of these things have been worked on.  Patches for many of them have
> already been posted.  They have suffered from a certain amount of
> neglect by senior hackers, and perhaps also from a shortage of time on
> the part of the authors.  But an awful lot of the work that is needed
> here has already been done, if only we could get it committed.
> Aggregate pushdown is a notable exception, but abandoning the foreign
> data wrapper approach in favor of something else won't fix that.
>
> Postgres-XC developed a purpose-built system for talking to other
> nodes instead of using the FDW interface, for the very good reason
> that the FDW interface did not yet exist at the time that Postgres-XC
> was created.  But several people associated with the XC project have
> said, including one on this thread, that if it had existed, they
> probably would have used it.  And it's hard to see why you wouldn't:
> with XC's approach, the remote data source is presumed to be
> PostgreSQL (or Postgres-XC/XL/X2/whatever); and you can only use the
> facility as part of a sharding solution.  The FDW interface can talk
> to anything, and it can be used for stuff other than sharding, like
> making one remote table appear local because you just happen to want
> that for some reason.  This makes the XC approach look rather brittle
> by comparison.  I don't blame the XC folks for taking the shortest
> path between two points, but FDWs are better, and we ought to try to
> leverage that.
>
>
In my discussions on this topic with various folks including Robert, I've
conceded that if FDW was available when XC was first written, in all
likelihood we would have used and extended that interface. But that wasn't
the case and we did what we thought was the best solution at that time,
given the resources and the schedule. To be honest, when XC project was
started, I was quite skeptical about the whole thing given the goal was to
built something which can replace Oracle RAC with may be less than 1%
resources of what Oracle must have invested in building RAC. The lack of
resources at the start of the project keeps showing up in the quality
issues that users report from time to time. Having said that, I am quite
satisfied with what we have been able to build within the constraints.

But FDW is just one part of the story. There is this entire global
consistency problem that would require something like GTM to give out XIDs
and snapshots, atomicity which would require managing transactions across
multiple shards, join pushdowns when all data is not available locally,
something that XL is attempting to solve with datanode-datanode exchange of
information, other global states such as sequences, replicating some part
of the data to multiple shards for efficient operations, ability to
add/remove shards with least disruption, globally consistent
backups/restore. XC/XL has attempted to solve each of them to some extent.
I don't claim that they are completely solved and there are no corner cases
left, but we have made fairly good progress on each of them.

My worry is that if we start implementing them again from scratch, it will
take a few years before we get them in a usable state. What XC/XL lacked is
probably a Robert Haas or a Tom Lane who could look at the work and suggest
major edits. If that had happened, the quality of the product could have
been much better today. I don't mean to derate the developers who worked on
XC/XL, but there is no harm in accepting that if someone with a much better
understanding of the whole system was part of the team, that would have
positively impacted the project. Is that an angle worth exploring? Does it
make sense to c

Re: [HACKERS] Horizontal scalability/sharding

2015-08-31 Thread Etsuro Fujita

On 2015/09/01 9:54, Bruce Momjian wrote:

On Mon, Aug 31, 2015 at 05:10:11PM -0700, Josh Berkus wrote:

As far as (3) is concerned, why
wouldn't we use the foreign data wrapper interface, and specifically
postgres_fdw?  That interface was designed for the explicit purpose of
allowing access to remote data sources, and a lot of work has been put
into it, so it would be highly surprising if we decided to throw that
away and develop something completely new from the ground up.


Well, query hooks are also a capability which we already have, and is
mature.  Citus has already posted about why they chose to use them instead.

As long as you recognize that the FDW API (not just the existing fdws)
will need to expand to make this work, it's a viable path.


Uh, we already have a list of things we need to add to FDWs to make them
work, and Citus Data has provided a document of more things that are
needed, https://goo.gl/vJWF85.


My top priority is postgres_fdw join pushdown, but I also plan to work 
on update pushdown [1] for 9.6, which couldn't make it into 9.5.  I 
think that would resolve the following issue mentioned in the document:


UPDATE and DELETE operations are performed by first fetching
records from the table scanning functions, and then going over
the fetched records. If the user wanted to update a single row,
this involved first pulling rows and then updating related
records.

Best regards,
Etsuro Fujita

[1] https://commitfest.postgresql.org/4/162/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-08-31 Thread Alvaro Herrera
Bruce Momjian wrote:

> My hope is that many FDW improvements will benefit sharding and
> non-sharding workloads, but I bet some improvements are going to be
> sharding-specific.  I would say we are still in the exploratory stage,
> but based on the number of people who care about this feature and want
> to be involved, I think we are off to a very good start.  :-)

Having lots of interested people doesn't help with some problems,
though.  The Citus document says:

And the issue with these four limitations wasn't with foreign
data wrappers. We wrote mongo_fdw and cstore_fdw, and we're
quite happy with the contract FDWs provide. The problem was that
we were trying to retrofit an API for something that it was
fundamentally not designed to do.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-08-31 Thread Bruce Momjian
On Mon, Aug 31, 2015 at 05:10:11PM -0700, Josh Berkus wrote:
> > As far as (3) is concerned, why
> > wouldn't we use the foreign data wrapper interface, and specifically
> > postgres_fdw?  That interface was designed for the explicit purpose of
> > allowing access to remote data sources, and a lot of work has been put
> > into it, so it would be highly surprising if we decided to throw that
> > away and develop something completely new from the ground up.
> 
> Well, query hooks are also a capability which we already have, and is
> mature.  Citus has already posted about why they chose to use them instead.
> 
> As long as you recognize that the FDW API (not just the existing fdws)
> will need to expand to make this work, it's a viable path.

Uh, we already have a list of things we need to add to FDWs to make them
work, and Citus Data has provided a document of more things that are
needed, https://goo.gl/vJWF85.  I am not sure how much bigger a red flag
you want to confirm that everyone agrees that major FDW improvements are
a requirement for this.  

My hope is that many FDW improvements will benefit sharding and
non-sharding workloads, but I bet some improvements are going to be
sharding-specific.  I would say we are still in the exploratory stage,
but based on the number of people who care about this feature and want
to be involved, I think we are off to a very good start.  :-)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-08-31 Thread Marc Munro
On Mon, 2015-08-31 at 22:21 +,  Robert Haas wrote:

> It seems to me that sharding consists of (1) breaking your data set up
> into shards, (2) possibly replicating some of those shards onto
> multiple machines, and then (3) being able to access the remote data
> from local queries. [...]

I believe there is another aspect to sharding that I have not yet seen
mentioned, which is one of connection routing.  

One use case that I have been involved in, is to simply partition the
application into entirely, or almost entirely, separate datasets running
on separate databases with little or no need for queries to access
remote data.

This allows each database to deal only with connections from clients
that actually want its local data, greatly reducing the number of
connections on any individual database.  If this works for your
application, your ability to scale is great.

The pain point comes from trying to route queries to the correct
database.  Inevitably, everyone taking this route builds custom
connection-selection layers into their apps.

It seems to me that even with the more sophisticated types of sharding
being discussed here, the ability to conditionally route a
query/connection to a suitable starting database could be quite
beneficial.

Although this is probably a job for the pgbouncer/pgpool developers
rather than the hackers on this list, this thread seems to be a good
place to mention it.

__
Marc




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-08-31 Thread Josh Berkus
On 08/31/2015 02:47 PM, Robert Haas wrote:
> On Mon, Aug 31, 2015 at 4:16 PM, Josh Berkus  wrote:
>> First, let me put out there that I think the horizontal scaling project
>> which has buy-in from the community and we're working on is infinitely
>> better than the one we're not working on or is an underresourced fork.
>> So we're in agreement on that.  However, I think there's a lot of room
>> for discussion; I feel like the FDW approach was decided in exclusive
>> meetings involving a very small number of people.  The FDW approach
>> *may* be the right approach, but I'd like to see some rigorous
>> questioning of that before it's final.
> 
> It seems to me that sharding consists of (1) breaking your data set up
> into shards, (2) possibly replicating some of those shards onto
> multiple machines, and then (3) being able to access the remote data
> from local queries.   As far as (1) is concerned, we need declarative
> partitioning, which is being worked on by Amit Langote.  As far as (2)
> is concerned, I hope and expect BDR, or technology derived therefrom,
> to eventually fill that need.  

Well, maybe.  If you look at pg_shard, you'll see that it works by
multiplexing writes to all copies.  There's a good reason to do that; it
allows you to have a tight feedback loop between the success of writes
and the availability of "good" nodes.  If you're depending on a separate
replication system to handle getting row copies from one shard to
another, then you need a different way to deal with bad nodes and with
inconsistency between copies of shards.  That's why the existing
multinode non-relational databases don't separate replication from
writes, either.

For that matter, if what you want is transactional fully ACID sharding,
I really don't see a way to do it via BDR, since BDR is purely
asynchronous replication, as far as I know.

Also, if we want BDR to do this, that's pretty far afield of what BDR is
currently capable of, so someone will need to put serious work into it
rather than just assuming functionality will show up.

> As far as (3) is concerned, why
> wouldn't we use the foreign data wrapper interface, and specifically
> postgres_fdw?  That interface was designed for the explicit purpose of
> allowing access to remote data sources, and a lot of work has been put
> into it, so it would be highly surprising if we decided to throw that
> away and develop something completely new from the ground up.

Well, query hooks are also a capability which we already have, and is
mature.  Citus has already posted about why they chose to use them instead.

As long as you recognize that the FDW API (not just the existing fdws)
will need to expand to make this work, it's a viable path.

Also consider that (3) includes both reads and writes.

> I think it's abundantly clear that we need a logical replication
> solution as part of any horizontal scalability story.  People will
> want to do things like have 10 machines with each piece of data on 3
> of them, and there won't be any reasonable way of doing that without
> logical replication.  I assume that BDR, or some technology derived
> from it, will end up in core and solve that problem.  I had actually
> hoped we were going to get that in 9.5, but it didn't happen that way.
> Still, I think that getting first single-master, and then eventually
> multi-master, logical replication in core is absolutely critical.  And
> not just for sharding specifically: replicating your whole database to
> several nodes and load-balancing your clients across them isn't
> sharding, but it does give you read scalability and is a good fit for
> people with geographically dispersed data with good geographical
> locality.  I think a lot of people will want that.

Well, the latter thing is something which BDR is designed for, so all
that needs to happen with that is getting the rest of the plumbing into
core.  Also documentation, packaging, productization, etc.  But the
heavy lifting has already been done.

However, integrating BDR with sharding has some major design issues
which aren't trivial and may be unresolvable, per above.

> I'm not quite sure yet how we can marry declarative partitioning and
> better FDW-pushdown and logical replication into one seamless, easy to
> deploy solution that requires very low administrator effort.  But I am
> sure that each of those things, taken individually, is very useful,
> and that being able to construct a solution from those building blocks
> would be a big improvement over what we have today.  I can't imagine
> that trying to do one monolithic project that provides all of those
> things, but only if you combine them in the specific way that the
> designer had in mind, is ever going to be successful.  People _will_
> want access to each of those features in an unbundled fashion.  And,
> trying to do them altogether leads to trying to solve too many
> problems at once.  I think the history of Postgres-XC is a cautionary
> tale there.

Yes.  It's 

Re: [HACKERS] Horizontal scalability/sharding

2015-08-31 Thread Sumedh Pathak
Hi Bruce,

Sumedh from Citus Data here.

> August, 2015:  While speaking at SFPUG, Citus Data approached me about joining
the FDW sharding team.  They have been invited to the September 1 meeting,
as have the XC and XL people.

I'd like to add a clarification. We already tried the FDW APIs for pg_shard
two years ago and failed. We figured sharing our learnings could contribute
to the technical discussion and that's why we wanted to be in the call.

Ozgun summarized our technical learnings in this design document:
https://goo.gl/vJWF85

In the document, we focused on one of the four learnings we had with FDW
APIs. For us, we switched to the hook API based approach, and things went
smoothly from there.

Best,
Sumedh

On Sat, Aug 29, 2015 at 7:17 PM, Bruce Momjian  wrote:

> I have recently increased my public statements about the idea of adding
> horizontal scaling/sharding to Postgres. I wanted to share with hackers
> a timeline of how we got here, and where I think we are going in the
> short term:
>
> 2012-2013:  As part of writing my scaling talk
> (http://momjian.us/main/presentations/overview.html#scaling), studying
> Oracle RAC, and talking to users, it became clear that an XC-like
> architecture (sharding) was the only architecture that was going to allow
> for write scaling.
>
> Users and conference attendees I talked to were increasingly concerned
> about the ability of Postgres to scale for high write volumes.  They didn't
> necessarily need that scale now, but they needed to know they could get
> it if they wanted it, and wouldn't need to switch to a new database in
> the future.  This is similar to wanting a car that can get you on a highway
> on-ramp fast --- even if you don't need it, you want to know it is there.
>
> 2014:  I started to shop around the idea that we could use FDWs,
> parallelism, and a transaction/snapshot manager to get XC features
> as built-in to Postgres.  (I don't remember where the original idea
> came from.)  It was clear that having separate forks of the source code
> in XC and XL was never going to achieve critical mass --- there just
> aren't enough people who need high right scale right now, and the fork
> maintenance overhead is a huge burden.
>
> I realized that we would never get community acceptance to dump the XC
> (or XL) code needed for sharding into community Postgres, but with FDWs,
> we could add the features as _part_ of improving FDWs, which would benefit
> FDWs _and_ would be useful for sharding.  (We already see some of those
> FDW features in 9.5.)
>
> October, 2014:  EDB and NTT started working together in the community
> to start improving FDWs as a basis for an FDW-based sharding solution.
> Many of the 9.5 FDW improvements that also benefit sharding were developed
> by a combined EDB/NTT team.  The features improved FDWs independent of
> sharding, so they didn't need community buy-in on sharding to get them
> accepted.
>
> June, 2015:  I attended the PGCon sharding unconference session and
> there was a huge discussion about where we should go with sharding.
> I think the big take-away was that most people liked the FDW approach,
> but had business/customer reasons for wanting to work on XC or XL because
> those would be production-ready faster.
>
> July, 2015:  Oleg Bartunov and his new company Postgres Professional (PP)
> started to think about joining the FDW approach, rather than working on
> XL, as they had stated at PGCon in June.  A joint NTT/EDB/PP phone-in
> meeting is scheduled for September 1.
>
> August, 2015:  While speaking at SFPUG, Citus Data approached me about
> joining the FDW sharding team.  They have been invited to the September
> 1 meeting, as have the XC and XL people.
>
> October, 2015:  EDB is sponsoring a free 3-hour summit about FDW sharding
> at the PG-EU conference in Vienna.   Everyone is invited, but it is hoped
> most of the September 1 folks can attend.
>
> February, 2016:  Oleg is planning a similar meeting at their February
> Moscow conference.
>
> Anyway, I wanted to explain the work that has been happening around
> sharding.  As things move forward, I am increasingly convinced that write
> scaling will be needed soon, that the XC approach is the only reasonable
> way to do it, and that FDWs are the cleanest way to get it into community
> Postgres.
>
> --
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
>   + Everyone has their own god. +
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
*Sumedh Pathak*
Citus Data
650.422.9797
sum...@citusdata.com


Re: [HACKERS] Horizontal scalability/sharding

2015-08-31 Thread Robert Haas
On Mon, Aug 31, 2015 at 4:16 PM, Josh Berkus  wrote:
> First, let me put out there that I think the horizontal scaling project
> which has buy-in from the community and we're working on is infinitely
> better than the one we're not working on or is an underresourced fork.
> So we're in agreement on that.  However, I think there's a lot of room
> for discussion; I feel like the FDW approach was decided in exclusive
> meetings involving a very small number of people.  The FDW approach
> *may* be the right approach, but I'd like to see some rigorous
> questioning of that before it's final.

It seems to me that sharding consists of (1) breaking your data set up
into shards, (2) possibly replicating some of those shards onto
multiple machines, and then (3) being able to access the remote data
from local queries.   As far as (1) is concerned, we need declarative
partitioning, which is being worked on by Amit Langote.  As far as (2)
is concerned, I hope and expect BDR, or technology derived therefrom,
to eventually fill that need.  As far as (3) is concerned, why
wouldn't we use the foreign data wrapper interface, and specifically
postgres_fdw?  That interface was designed for the explicit purpose of
allowing access to remote data sources, and a lot of work has been put
into it, so it would be highly surprising if we decided to throw that
away and develop something completely new from the ground up.

It's true that postgres_fdw doesn't do everything we need yet.  The
new join pushdown hooks aren't used by postgres_fdw yet, and the API
itself has some bugs with EvalPlanQual handling.  Aggregate pushdown
is waiting on upper planner path-ification.   DML pushdown doesn't
exist yet, and the hooks that would enable pushdown of ORDER BY
clauses to the remote side aren't being used by postgres_fdw.  But all
of these things have been worked on.  Patches for many of them have
already been posted.  They have suffered from a certain amount of
neglect by senior hackers, and perhaps also from a shortage of time on
the part of the authors.  But an awful lot of the work that is needed
here has already been done, if only we could get it committed.
Aggregate pushdown is a notable exception, but abandoning the foreign
data wrapper approach in favor of something else won't fix that.

Postgres-XC developed a purpose-built system for talking to other
nodes instead of using the FDW interface, for the very good reason
that the FDW interface did not yet exist at the time that Postgres-XC
was created.  But several people associated with the XC project have
said, including one on this thread, that if it had existed, they
probably would have used it.  And it's hard to see why you wouldn't:
with XC's approach, the remote data source is presumed to be
PostgreSQL (or Postgres-XC/XL/X2/whatever); and you can only use the
facility as part of a sharding solution.  The FDW interface can talk
to anything, and it can be used for stuff other than sharding, like
making one remote table appear local because you just happen to want
that for some reason.  This makes the XC approach look rather brittle
by comparison.  I don't blame the XC folks for taking the shortest
path between two points, but FDWs are better, and we ought to try to
leverage that.

> Particularly, I'm concerned that we already have two projects in process
> aimed at horizontal scalability, and it seems like we could bring either
> (or both) projects to production quality MUCH faster than we could make
> an FDW-based solution work.  These are:
>
> * pg_shard
> * BDR
>
> It seems worthwhile, just as a thought experiment, if we can get where
> we want using those, faster, or by combining those with new FDW features.

I think it's abundantly clear that we need a logical replication
solution as part of any horizontal scalability story.  People will
want to do things like have 10 machines with each piece of data on 3
of them, and there won't be any reasonable way of doing that without
logical replication.  I assume that BDR, or some technology derived
from it, will end up in core and solve that problem.  I had actually
hoped we were going to get that in 9.5, but it didn't happen that way.
Still, I think that getting first single-master, and then eventually
multi-master, logical replication in core is absolutely critical.  And
not just for sharding specifically: replicating your whole database to
several nodes and load-balancing your clients across them isn't
sharding, but it does give you read scalability and is a good fit for
people with geographically dispersed data with good geographical
locality.  I think a lot of people will want that.

I'm not quite sure yet how we can marry declarative partitioning and
better FDW-pushdown and logical replication into one seamless, easy to
deploy solution that requires very low administrator effort.  But I am
sure that each of those things, taken individually, is very useful,
and that being able to construct a solution from those building blocks
would

Re: [HACKERS] Horizontal scalability/sharding

2015-08-31 Thread Joshua D. Drake

On 08/31/2015 01:16 PM, Josh Berkus wrote:

All, Bruce:




I'm also going to pontificate that, for a future solution, we should not
focus on write *IO*, but rather on CPU and RAM. The reason for this
thinking is that, with the latest improvements in hardware and 9.5
improvements, it's increasingly rare for machines to be bottlenecked on
writes to the transaction log (or the heap). This has some implications
for system design.  For example, solutions which require all connections
to go through a single master node do not scale sufficiently to be worth
bothering with.


We see this already, under very high concurrency (lots of connections, 
many cores) we often see a significant drop in performance that is not 
related to IO in any meaningful way.


JD


--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-08-31 Thread Josh Berkus
All, Bruce:

First, let me put out there that I think the horizontal scaling project
which has buy-in from the community and we're working on is infinitely
better than the one we're not working on or is an underresourced fork.
So we're in agreement on that.  However, I think there's a lot of room
for discussion; I feel like the FDW approach was decided in exclusive
meetings involving a very small number of people.  The FDW approach
*may* be the right approach, but I'd like to see some rigorous
questioning of that before it's final.

Particularly, I'm concerned that we already have two projects in process
aimed at horizontal scalability, and it seems like we could bring either
(or both) projects to production quality MUCH faster than we could make
an FDW-based solution work.  These are:

* pg_shard
* BDR

It seems worthwhile, just as a thought experiment, if we can get where
we want using those, faster, or by combining those with new FDW features.

It's also important to recognize that there are three major use-cases
for write-scalable clustering:

* OLTP: small-medium cluster, absolute ACID consistency,
  bottlnecked on small writes per second
* DW: small-large cluster, ACID optional,
  bottlenecked on bulk reads/writes
* Web: medium to very large cluster, ACID optional,
  bottlenecked on # of connections

We cannot possibly solve all of the above at once, but to the extent
that we recognize all 3 use cases, we can build core features which can
be adapted to all of them.

I'm also going to pontificate that, for a future solution, we should not
focus on write *IO*, but rather on CPU and RAM. The reason for this
thinking is that, with the latest improvements in hardware and 9.5
improvements, it's increasingly rare for machines to be bottlenecked on
writes to the transaction log (or the heap). This has some implications
for system design.  For example, solutions which require all connections
to go through a single master node do not scale sufficiently to be worth
bothering with.

On some other questions from Mason:

> Do we want multiple copies of shards, like the pg_shard approach? Or
> keep things simpler and leave it up to the DBA to add standbys? 

We want multiple copies of shards created by the sharding system itself.
 Having a separate, and completely orthagonal, redundancy system to the
sharding system is overly burdensome on the DBA and makes low-data-loss
HA impossible.

> Do we want to leverage table inheritance? If so, we may want to spend
> time improving performance for when the number of shards becomes large
> with what currently exists. If using table inheritance, we could add the
> ability to specify what node (er, foreign server) the subtable lives on.
> We could create top level sharding expressions that allow these to be
> implicitly created.

IMHO, given that we're looking at replacing inheritance because of its
many documented limitations, building sharding on top of inheritance
seems unwise.  For example, many sharding systems are hash-based; how
would an inheritance system transparently use hash keys?

> Should we allow arbitrary expressions for shards, not just range, list
> and hash?

That seems like a 2.0 feature.  It also doesn't seem necessary to
support it for the moderately skilled user; that is, requiring a special
C sharding function for this seems fine to me.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-08-31 Thread Qingqing Zhou
On Mon, Aug 31, 2015 at 2:12 AM, Oleg Bartunov  wrote:
>
> AFAIK, XC/XL has already some customers and that is an additional pressure
> on their development team, which is now called X2. I don't exactly know how
> internal Huawei's MPPDB is connected to XC/XL.
>

Huawei's MPPDB is based on PG-XC and tailored it more targeting OLAP scenarios.

The basic idea is that OLAP needs a shared nothing scale out
architecture for read and write. It needs ok-TP-performance, a
restricted set of functionality, and thus avoids some problems like
GTM being a central scaling bottleneck.

I advocate to merge PostgreSQL core with scale-out features, if we are
ready to face some long time functional discrepancies between the two
deployments.

Regards,
Qingqing


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-08-31 Thread Alexander Korotkov
On Mon, Aug 31, 2015 at 9:48 PM, Mason S  wrote:

>
>>  We also a bit disappointed by Huawei position about CSN patch, we hoped
>> to use for  our XTM.
>>
>
> Disappointed in what way? Moving to some sort of CSN approach seems to
> open things up for different future ideas. In the short term, it would mean
> replacing potentially large snapshots and longer visibility checks. In the
> long term, perhaps CSN could help simplify the design of multi-master
> replication schemes.
>

We are disappointed because at PGCon talk Huawei announced publishing of
their CSN patch and further work in this direction together with community.
However, it's even not published yet despite all the promises. Nobody from
Huawei answers CSN thread in the hackers.
So, I think we got nothing from Huawei except teasing and should rely only
on ourselves. That is disappointing.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Horizontal scalability/sharding

2015-08-31 Thread Bruce Momjian
On Mon, Aug 31, 2015 at 02:48:31PM -0400, Mason S wrote:
> I assume that future work around PG sharding probably would be more likely to
> be accepted with the FDW approach. One could perhaps work on pushing down
> joins, aggregates and order by, then look at any optimizations gained if code
> is moved outside of FDW.  It would make sense if some kind of generic
> optimization for foreign tables for SQL-based sources could be leveraged 
> across
> all databases, rather than having to re-implement for each FDW.
> 
> There are different approaches and related features that may need to be
> improved.
> 
> Do we want multiple copies of shards, like the pg_shard approach? Or keep
> things simpler and leave it up to the DBA to add standbys? 

I agree with all of the above.

> Do we want to leverage table inheritance? If so, we may want to spend time
> improving performance for when the number of shards becomes large with what
> currently exists. If using table inheritance, we could add the ability to
> specify what node (er, foreign server) the subtable lives on. We could create
> top level sharding expressions that allow these to be implicitly created.
> 
> Should we allow arbitrary expressions for shards, not just range, list and
> hash?
> 
> Maybe the most community-acceptable approach would look something like

I think everyone agrees that our current partitioning setup is just too
verbose and error-prone for users, and needs a simpler interface, and
one that can be better optimized internally.  I assume FDW-based
sharding will benefit from that work as well.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-08-31 Thread Mason S
>
>
>  We also a bit disappointed by Huawei position about CSN patch, we hoped
> to use for  our XTM.
>

Disappointed in what way? Moving to some sort of CSN approach seems to open
things up for different future ideas. In the short term, it would mean
replacing potentially large snapshots and longer visibility checks. In the
long term, perhaps CSN could help simplify the design of multi-master
replication schemes.


> FDW approach has been actively criticized by pg_shard people and that's
> also made me a bit suspicious.  It looks like  we are doomed to continue
> several development forks, so we decided to work on very important common
> project, XTM, which we hoped could be accepted by all parties and
> eventually committed to 9.6.  Now I see we were right, unfortunately.
>
>
I think the original XC project probably would have taken the FDW approach
as a basis if it had existed, with focus on push-down optimizations.

I assume that future work around PG sharding probably would be more likely
to be accepted with the FDW approach. One could perhaps work on pushing
down joins, aggregates and order by, then look at any optimizations gained
if code is moved outside of FDW.  It would make sense if some kind of
generic optimization for foreign tables for SQL-based sources could be
leveraged across all databases, rather than having to re-implement for each
FDW.

There are different approaches and related features that may need to be
improved.

Do we want multiple copies of shards, like the pg_shard approach? Or keep
things simpler and leave it up to the DBA to add standbys?

Do we want to leverage table inheritance? If so, we may want to spend time
improving performance for when the number of shards becomes large with what
currently exists. If using table inheritance, we could add the ability to
specify what node (er, foreign server) the subtable lives on. We could
create top level sharding expressions that allow these to be implicitly
created.

Should we allow arbitrary expressions for shards, not just range, list and
hash?

Maybe the most community-acceptable approach would look something like

- Use FDWs, and continue to optimize push-down operations, also for
non-PostgreSQL databases.

- Use table inheritance for defining the shards. Ideally allow for
specifying that some shards may be replicated to other foreign servers (and
itself) (for pushing down joins with lookup/static tables; at this point it
should be decent for star schema based data warehouses).

- XTM/GTM hooks. Preferably we move to CSN for snapshots in core PostgreSQL
though.

Longer term, efficient internode joins would require a lot more work.

The devil is in the details. There are things that have to be addressed,
for example, if using global XIDs via GTM, not every transaction is on
every node, so we need to make sure that new clog pages get added
properly.  There is also the potential to require a lot more code to be
added, like for cursor handling and stored functions. Perhaps some
limitations when using shards to foreign servers are acceptable if it is
desired to minimize code changes.  XC and XL code help.

Regards,

Mason


  1   2   >