Re: [HACKERS] logical changeset generation v6.2

2013-10-29 Thread Andres Freund
On 2013-10-28 11:54:31 -0400, Robert Haas wrote:
  There's one snag I currently can see, namely that we actually need to
  prevent that a formerly dropped relfilenode is getting reused. Not
  entirely sure what the best way for that is.
 
 I'm not sure in detail, but it seems to me that this all part of the
 same picture.  If you're tracking changed relfilenodes, you'd better
 track dropped ones as well.

What I am thinking about is the way GetNewRelFileNode() checks for
preexisting relfilenodes. It uses SnapshotDirty to scan for existing
relfilenodes for a newly created oid. Which means already dropped
relations could be reused.
I guess it could be as simple as using SatisfiesAny (or even better a
wrapper around SatisfiesVacuum that knows about recently dead tuples).

 Completely aside from this issue, what
 keeps a relation from being dropped before we've decoded all of the
 changes made to its data before the point at which it was dropped?  (I
 hope the answer isn't nothing.)

Nothing. But there's no need to prevent it, it'll still be in the
catalog and we don't ever access a non-catalog relation's data during
decoding.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-29 Thread Robert Haas
On Tue, Oct 29, 2013 at 10:47 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-10-28 11:54:31 -0400, Robert Haas wrote:
  There's one snag I currently can see, namely that we actually need to
  prevent that a formerly dropped relfilenode is getting reused. Not
  entirely sure what the best way for that is.

 I'm not sure in detail, but it seems to me that this all part of the
 same picture.  If you're tracking changed relfilenodes, you'd better
 track dropped ones as well.

 What I am thinking about is the way GetNewRelFileNode() checks for
 preexisting relfilenodes. It uses SnapshotDirty to scan for existing
 relfilenodes for a newly created oid. Which means already dropped
 relations could be reused.
 I guess it could be as simple as using SatisfiesAny (or even better a
 wrapper around SatisfiesVacuum that knows about recently dead tuples).

I think modifying GetNewRelFileNode() is attacking the problem from
the wrong end.  The point is that when a table is dropped, that fact
can be communicated to the same machine machinery that's been tracking
the CTID-CTID mappings.  Instead of saying hey, the tuples that were
in relfilenode 12345 are now in relfilenode 67890 in these new
positions, it can say hey, the tuples that were in relfilenode 12345
are now GONE.

 Completely aside from this issue, what
 keeps a relation from being dropped before we've decoded all of the
 changes made to its data before the point at which it was dropped?  (I
 hope the answer isn't nothing.)

 Nothing. But there's no need to prevent it, it'll still be in the
 catalog and we don't ever access a non-catalog relation's data during
 decoding.

Oh, right.  But what about a drop of a user-catalog table?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-29 Thread Andres Freund
On 2013-10-29 11:28:44 -0400, Robert Haas wrote:
 On Tue, Oct 29, 2013 at 10:47 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2013-10-28 11:54:31 -0400, Robert Haas wrote:
   There's one snag I currently can see, namely that we actually need to
   prevent that a formerly dropped relfilenode is getting reused. Not
   entirely sure what the best way for that is.
 
  I'm not sure in detail, but it seems to me that this all part of the
  same picture.  If you're tracking changed relfilenodes, you'd better
  track dropped ones as well.
 
  What I am thinking about is the way GetNewRelFileNode() checks for
  preexisting relfilenodes. It uses SnapshotDirty to scan for existing
  relfilenodes for a newly created oid. Which means already dropped
  relations could be reused.
  I guess it could be as simple as using SatisfiesAny (or even better a
  wrapper around SatisfiesVacuum that knows about recently dead tuples).

 I think modifying GetNewRelFileNode() is attacking the problem from
 the wrong end.  The point is that when a table is dropped, that fact
 can be communicated to the same machine machinery that's been tracking
 the CTID-CTID mappings.  Instead of saying hey, the tuples that were
 in relfilenode 12345 are now in relfilenode 67890 in these new
 positions, it can say hey, the tuples that were in relfilenode 12345
 are now GONE.

Unfortunately I don't understand what you're suggesting. What I am
worried about is something like:

- decoding is here
VACUUM FULL pg_class; -- rewrites filenode 1 to 2
VACUUM FULL pg_class; -- rewrites filenode 2 to 3
VACUUM FULL pg_class; -- rewrites filenode 3 to 1
- now decode up to here

In this case there are two possible (cmin,cmax) values for a specific
tuple. One from the original filenode 1 and one for the one generated
from 3.
Now that will only happen if there's an oid wraparound which hopefully
shouldn't happen very often, but I'd like to not rely on that.

  Completely aside from this issue, what
  keeps a relation from being dropped before we've decoded all of the
  changes made to its data before the point at which it was dropped?  (I
  hope the answer isn't nothing.)
 
  Nothing. But there's no need to prevent it, it'll still be in the
  catalog and we don't ever access a non-catalog relation's data during
  decoding.

 Oh, right.  But what about a drop of a user-catalog table?

Currently nothing prevents that. I am not sure it's worth worrying about
it, do you think we should?

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-29 Thread Robert Haas
On Tue, Oct 29, 2013 at 11:43 AM, Andres Freund and...@2ndquadrant.com wrote:
 I think modifying GetNewRelFileNode() is attacking the problem from
 the wrong end.  The point is that when a table is dropped, that fact
 can be communicated to the same machine machinery that's been tracking
 the CTID-CTID mappings.  Instead of saying hey, the tuples that were
 in relfilenode 12345 are now in relfilenode 67890 in these new
 positions, it can say hey, the tuples that were in relfilenode 12345
 are now GONE.

 Unfortunately I don't understand what you're suggesting. What I am
 worried about is something like:

 - decoding is here
 VACUUM FULL pg_class; -- rewrites filenode 1 to 2
 VACUUM FULL pg_class; -- rewrites filenode 2 to 3
 VACUUM FULL pg_class; -- rewrites filenode 3 to 1
 - now decode up to here

 In this case there are two possible (cmin,cmax) values for a specific
 tuple. One from the original filenode 1 and one for the one generated
 from 3.
 Now that will only happen if there's an oid wraparound which hopefully
 shouldn't happen very often, but I'd like to not rely on that.

Ah, OK.  I didn't properly understand the scenario you were concerned
about.  There's only a potential problem here if we get behind by more
than 4 billion relfilenodes, which seems remote, but maybe not:

http://www.pgcon.org/2013/schedule/events/595.en.html

This still seems to me to be basically an accounting problem.  At any
given time, we should *know* where the catalog tuples are located.  We
can't be decoding changes that require a given system catalog while
that system catalog is locked, so any given decoding operation happens
either before or after, not during, the rewrite of the corresponding
catalog.  As long as that VACUUM FULL operation is responsible for
updating the logical decoding metadata, we should be fine.  Any
relcache entries referencing the old relfilenode need to be
invalidated, and any CTID-[cmin,cmax] maps we're storing for those
old relfilenodes need to be invalidated, too.

  Completely aside from this issue, what
  keeps a relation from being dropped before we've decoded all of the
  changes made to its data before the point at which it was dropped?  (I
  hope the answer isn't nothing.)
 
  Nothing. But there's no need to prevent it, it'll still be in the
  catalog and we don't ever access a non-catalog relation's data during
  decoding.

 Oh, right.  But what about a drop of a user-catalog table?

 Currently nothing prevents that. I am not sure it's worth worrying about
 it, do you think we should?

Maybe.  Depends partly on how ugly things get if it happens, I suppose.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-28 Thread Robert Haas
On Fri, Oct 25, 2013 at 7:57 AM, Andres Freund and...@2ndquadrant.com wrote:
 However, I'm leery about the idea of using a relation fork for this.
 I'm not sure whether that's what you had it mind, but it gives me the
 willies.  First, it adds distributed overhead to the system, as
 previously discussed; and second, I think the accounting may be kind
 of tricky, especially in the face of multiple rewrites.  I'd be more
 inclined to find a separate place to store the mappings.  Note that,
 AFAICS, there's no real need for the mapping file to be
 block-structured, and I believe they'll be written first (with no
 readers) and subsequently only read (with no further writes) and
 eventually deleted.

 I was thinking of storing it along other data used during logical
 decoding and let decoding's cleanup clean up that data as well. All the
 information for that should be there.

That seems OK.

 There's one snag I currently can see, namely that we actually need to
 prevent that a formerly dropped relfilenode is getting reused. Not
 entirely sure what the best way for that is.

I'm not sure in detail, but it seems to me that this all part of the
same picture.  If you're tracking changed relfilenodes, you'd better
track dropped ones as well.  Completely aside from this issue, what
keeps a relation from being dropped before we've decoded all of the
changes made to its data before the point at which it was dropped?  (I
hope the answer isn't nothing.)

 One possible objection to this is that it would preclude decoding on a
 standby, which seems like a likely enough thing to want to do.  So
 maybe it's best to WAL-log the changes to the mapping file so that the
 standby can reconstruct it if needed.

 The mapping file probably can be one big wal record, so it should be
 easy enough to do.

It might be better to batch it, because if you rewrite a big relation,
and the record is really big, everyone else will be frozen out of
inserting WAL for as long as that colossal record is being written and
synced.  If it's inserted in reasonably-sized chunks, the rest of the
system won't be starved as badly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-28 Thread Robert Haas
On Fri, Oct 25, 2013 at 8:14 AM, Andres Freund and...@2ndquadrant.com wrote:
 So, I thought about this for some more and I think I've a partial
 solution to the problem.

 The worst thing about deadlocks that occur in the above is that they
 could be the VACUUM FULL waiting for the restart LSN[1] of a decoding
 slot to progress, but the restart LSN cannot progress because the slot
 is waiting for a xid/transaction to end which is being blocked by the
 lock upgrade from VACUUM FULL. Such conflicts are not visible to the
 deadlock detector, which obviously is bad.
 I've prototyped this (~25 lines) and this happens pretty frequently. But
 it turns out that we can actually fix this by exporting (to shared
 memory) the oldest in-progress xid of a decoding slot. Then the waiting
 code can do a XactLockTableWait() for that xid...

 I wonder if this is isn't maybe sufficient. Yes, it can deadlock, but
 that's already the case for VACUUM FULLs of system tables, although less
 likely. And it will be detected/handled.
 There's one more snag though, we currently allow CLUSTER system_table;
 in an existing transaction. I think that'd have to be disallowed.

It wouldn't bother me too much to restrict CLUSTER system_table by
PreventTransactionChain() at wal_level = logical, but obviously it
would be nicer if we *didn't* have to do that.

In general, I don't think waiting on an XID is sufficient because a
process can acquire a heavyweight lock without having an XID.  Perhaps
use the VXID instead?

One thought I had about waiting for decoding to catch up is that you
might do it before acquiring the lock.  Of course, you then have a
problem if you get behind again before acquiring the lock.  It's
tempting to adopt the solution we used for RangeVarGetRelidExtended,
namely: wait for catchup without the lock, acquire the lock, see
whether we're still caught up if so great else release lock and loop.
But there's probably too much starvation risk to get away with that.

On the whole, I'm leaning toward thinking that the other solution
(recording the old-to-new CTID mappings generated by CLUSTER to the
extent that they are needed) is probably more elegant.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-28 Thread Andres Freund
On 2013-10-28 12:04:01 -0400, Robert Haas wrote:
 On Fri, Oct 25, 2013 at 8:14 AM, Andres Freund and...@2ndquadrant.com wrote:

  I wonder if this is isn't maybe sufficient. Yes, it can deadlock, but
  that's already the case for VACUUM FULLs of system tables, although less
  likely. And it will be detected/handled.
  There's one more snag though, we currently allow CLUSTER system_table;
  in an existing transaction. I think that'd have to be disallowed.
 
 It wouldn't bother me too much to restrict CLUSTER system_table by
 PreventTransactionChain() at wal_level = logical, but obviously it
 would be nicer if we *didn't* have to do that.
 
 In general, I don't think waiting on an XID is sufficient because a
 process can acquire a heavyweight lock without having an XID.  Perhaps
 use the VXID instead?

But decoding doesn't care about transactions that haven't used an XID
yet (since that means they haven't modified the catalog), so that
shouldn't be problematic.

 One thought I had about waiting for decoding to catch up is that you
 might do it before acquiring the lock.  Of course, you then have a
 problem if you get behind again before acquiring the lock.  It's
 tempting to adopt the solution we used for RangeVarGetRelidExtended,
 namely: wait for catchup without the lock, acquire the lock, see
 whether we're still caught up if so great else release lock and loop.
 But there's probably too much starvation risk to get away with that.

I think we'd pretty much always starve in that case. It'd be different
if we could detect that there weren't any writes to the table
inbetween. I can see doing that using a locking hack like autovac uses,
but brr, that'd be ugly.

 On the whole, I'm leaning toward thinking that the other solution
 (recording the old-to-new CTID mappings generated by CLUSTER to the
 extent that they are needed) is probably more elegant.

I personally still think that the wide cmin/cmax solution is *much*
more elegant, simpler and actually can be used for other things than
logical decoding.
Since you don't seem to agree I am going to write a prototype using such
a mapping to see how it will look though.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-28 Thread Robert Haas
On Mon, Oct 28, 2013 at 12:17 PM, Andres Freund and...@2ndquadrant.com wrote:
 In general, I don't think waiting on an XID is sufficient because a
 process can acquire a heavyweight lock without having an XID.  Perhaps
 use the VXID instead?

 But decoding doesn't care about transactions that haven't used an XID
 yet (since that means they haven't modified the catalog), so that
 shouldn't be problematic.

Hmm, maybe.  But what if the deadlock has more members?  e.g. A is
blocking decoding by holding AEL w/no XID, and B is blocking A by
doing VF on a rel A needs, and decoding is blocking B.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-25 Thread Andres Freund
On 2013-10-24 10:59:21 -0400, Robert Haas wrote:
 On Tue, Oct 22, 2013 at 2:13 PM, Andres Freund and...@2ndquadrant.com wrote:
  On 2013-10-22 13:57:53 -0400, Robert Haas wrote:
  On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund and...@2ndquadrant.com 
  wrote:
   That strikes me as a flaw in the implementation rather than the idea.
   You're presupposing a patch where the necessary information is
   available in WAL yet you don't make use of it at the proper time.
  
   The problem is that the mapping would be somewhere *ahead* from the
   transaction/WAL we're currently decoding. We'd need to read ahead till
   we find the correct one.
 
  Yes, I think that's what you need to do.
 
  My problem with that is that rewrite can be gigabytes into the future.
 
  When reading forward we could either just continue reading data into the
  reorderbuffer, but delay replaying all future commits till we found the
  currently needed remap. That might have quite the additional
  storage/memory cost, but runtime complexity should be the same as normal
  decoding.
  Or we could individually read ahead for every transaction. But doing so
  for every transaction will get rather expensive (rougly O(amount_of_wal^2)).
 
 [ Sorry it's taken me a bit of time to get back to this; other tasks
 intervened, and I also just needed some time to let it settle in my
 brain. ]

No worries. I've had enough things to work on ;)

 If you read ahead looking for a set of ctid translations from
 relfilenode A to relfilenode B, and along the way you happen to
 encounter a set of translations from relfilenode C to relfilenode D,
 you could stash that set of translations away somewhere, so that if
 the next transaction you process needs that set of mappings, it's
 already computed.  With that approach, you'd never have to pre-read
 the same set of WAL files more than once.

 But, as I think about it more, that's not very different from your
 idea of stashing the translations someplace other than WAL in the
 first place.  I mean, if the read-ahead thread generates a series of
 files in pg_somethingorother that contain those maps, you could have
 just written the maps to that directory in the first place.  So on
 further review I think we could adopt that approach.

Yea, that basically was my reasoning, only expressed much more nicely ;)

 However, I'm leery about the idea of using a relation fork for this.
 I'm not sure whether that's what you had it mind, but it gives me the
 willies.  First, it adds distributed overhead to the system, as
 previously discussed; and second, I think the accounting may be kind
 of tricky, especially in the face of multiple rewrites.  I'd be more
 inclined to find a separate place to store the mappings.  Note that,
 AFAICS, there's no real need for the mapping file to be
 block-structured, and I believe they'll be written first (with no
 readers) and subsequently only read (with no further writes) and
 eventually deleted.

I was thinking of storing it along other data used during logical
decoding and let decoding's cleanup clean up that data as well. All the
information for that should be there.

There's one snag I currently can see, namely that we actually need to
prevent that a formerly dropped relfilenode is getting reused. Not
entirely sure what the best way for that is.

 One possible objection to this is that it would preclude decoding on a
 standby, which seems like a likely enough thing to want to do.  So
 maybe it's best to WAL-log the changes to the mapping file so that the
 standby can reconstruct it if needed.

The mapping file probably can be one big wal record, so it should be
easy enough to do.

For a moment I thought there's a problem with decoding on the standby
having to read ahead of the current location to find the newer mapping,
but that's actually not required since we're protected by the AEL lock
during rewrites on the standby as well.

  I think that'd be pretty similar to just disallowing VACUUM
  FREEZE/CLUSTER on catalog relations since effectively it'd be to
  expensive to use.
 
 This seems unduly pessimistic to me; unless the catalogs are really
 darn big, this is a mostly theoretical problem.

Well, it's not the size of the relation, but the amount of concurrent
WAL that's being generated that matters. But anyway, if we do it like
you described above that shouldn't be a problem.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-25 Thread Andres Freund
On 2013-10-21 16:15:58 +0200, Andres Freund wrote:
  I don't think I understand exactly what you have in mind for (2); can
  you elaborate?  I have always thought that having a
  WaitForDecodingToCatchUp() primitive was a good way of handling
  changes that were otherwise too difficult to track our way through.  I
  am not sure you're doing that at all right now, which in some sense I
  guess is fine, but I haven't really understood your aversion to this
  solution.  There are some locking issues to be worked out here, but
  the problems don't seem altogether intractable.
 
 So, what we need to do for rewriting catalog tables would be:
 1) lock table against writes
 2) wait for all in-progress xacts to finish, they could have modified
the table in question (we don't keep locks on system tables)
 3) acquire xlog insert pointer
 4) wait for all logical decoding actions to read past that pointer
 5) upgrade the lock to an access exclusive one
 6) perform vacuum full as usual
 
 The lock upgrade hazards in here are the reason I am adverse to the
 solution. And I don't see how we can avoid them, since in order for
 decoding to catchup it has to be able to read from the
 catalog... Otherwise it's easy enough to implement.

So, I thought about this for some more and I think I've a partial
solution to the problem.

The worst thing about deadlocks that occur in the above is that they
could be the VACUUM FULL waiting for the restart LSN[1] of a decoding
slot to progress, but the restart LSN cannot progress because the slot
is waiting for a xid/transaction to end which is being blocked by the
lock upgrade from VACUUM FULL. Such conflicts are not visible to the
deadlock detector, which obviously is bad.
I've prototyped this (~25 lines) and this happens pretty frequently. But
it turns out that we can actually fix this by exporting (to shared
memory) the oldest in-progress xid of a decoding slot. Then the waiting
code can do a XactLockTableWait() for that xid...

I wonder if this is isn't maybe sufficient. Yes, it can deadlock, but
that's already the case for VACUUM FULLs of system tables, although less
likely. And it will be detected/handled.
There's one more snag though, we currently allow CLUSTER system_table;
in an existing transaction. I think that'd have to be disallowed.

What do you think?

Greetings,

Andres Freund

[1] The restart LSN is the point from where we need to be able read
WAL to replay all changes the receiving side hasn't acked yet.

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-24 Thread Robert Haas
On Tue, Oct 22, 2013 at 2:13 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-10-22 13:57:53 -0400, Robert Haas wrote:
 On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  That strikes me as a flaw in the implementation rather than the idea.
  You're presupposing a patch where the necessary information is
  available in WAL yet you don't make use of it at the proper time.
 
  The problem is that the mapping would be somewhere *ahead* from the
  transaction/WAL we're currently decoding. We'd need to read ahead till
  we find the correct one.

 Yes, I think that's what you need to do.

 My problem with that is that rewrite can be gigabytes into the future.

 When reading forward we could either just continue reading data into the
 reorderbuffer, but delay replaying all future commits till we found the
 currently needed remap. That might have quite the additional
 storage/memory cost, but runtime complexity should be the same as normal
 decoding.
 Or we could individually read ahead for every transaction. But doing so
 for every transaction will get rather expensive (rougly O(amount_of_wal^2)).

[ Sorry it's taken me a bit of time to get back to this; other tasks
intervened, and I also just needed some time to let it settle in my
brain. ]

If you read ahead looking for a set of ctid translations from
relfilenode A to relfilenode B, and along the way you happen to
encounter a set of translations from relfilenode C to relfilenode D,
you could stash that set of translations away somewhere, so that if
the next transaction you process needs that set of mappings, it's
already computed.  With that approach, you'd never have to pre-read
the same set of WAL files more than once.

But, as I think about it more, that's not very different from your
idea of stashing the translations someplace other than WAL in the
first place.  I mean, if the read-ahead thread generates a series of
files in pg_somethingorother that contain those maps, you could have
just written the maps to that directory in the first place.  So on
further review I think we could adopt that approach.

However, I'm leery about the idea of using a relation fork for this.
I'm not sure whether that's what you had it mind, but it gives me the
willies.  First, it adds distributed overhead to the system, as
previously discussed; and second, I think the accounting may be kind
of tricky, especially in the face of multiple rewrites.  I'd be more
inclined to find a separate place to store the mappings.  Note that,
AFAICS, there's no real need for the mapping file to be
block-structured, and I believe they'll be written first (with no
readers) and subsequently only read (with no further writes) and
eventually deleted.

One possible objection to this is that it would preclude decoding on a
standby, which seems like a likely enough thing to want to do.  So
maybe it's best to WAL-log the changes to the mapping file so that the
standby can reconstruct it if needed.

 I think that'd be pretty similar to just disallowing VACUUM
 FREEZE/CLUSTER on catalog relations since effectively it'd be to
 expensive to use.

This seems unduly pessimistic to me; unless the catalogs are really
darn big, this is a mostly theoretical problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Robert Haas
On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund and...@2ndquadrant.com wrote:
 So. As it turns out that solution isn't sufficient in the face of VACUUM
 FULL and mixed DML/DDL transaction that have not yet been decoded.

 To reiterate, as published it works like:
 For every modification of catalog tuple (insert, multi_insert, update,
 delete) that has influence over visibility issue a record that contains:
 * filenode
 * ctid
 * (cmin, cmax)

 When doing a visibility check on a catalog row during decoding of mixed
 DML/DDL transaction lookup (cmin, cmax) for that row since we don't
 store both for the tuple.

 That mostly works great.

 The problematic scenario is decoding a transaction that has done mixed
 DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM
 FULL obviously changes the filenode and the ctid of a tuple, so we
 cannot successfully do a lookup based on what we logged before.

So I have a new idea for handling this problem, which seems obvious in
retrospect.  What if we make the VACUUM FULL or CLUSTER log the old
CTID - new CTID mappings?  This would only need to be done for
catalog tables, and maybe could be skipped for tuples whose XIDs are
old enough that we know those transactions must already be decoded.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Andres Freund
On 2013-10-22 10:52:48 -0400, Robert Haas wrote:
 On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund and...@2ndquadrant.com wrote:
  So. As it turns out that solution isn't sufficient in the face of VACUUM
  FULL and mixed DML/DDL transaction that have not yet been decoded.
 
  To reiterate, as published it works like:
  For every modification of catalog tuple (insert, multi_insert, update,
  delete) that has influence over visibility issue a record that contains:
  * filenode
  * ctid
  * (cmin, cmax)
 
  When doing a visibility check on a catalog row during decoding of mixed
  DML/DDL transaction lookup (cmin, cmax) for that row since we don't
  store both for the tuple.
 
  That mostly works great.
 
  The problematic scenario is decoding a transaction that has done mixed
  DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM
  FULL obviously changes the filenode and the ctid of a tuple, so we
  cannot successfully do a lookup based on what we logged before.
 
 So I have a new idea for handling this problem, which seems obvious in
 retrospect.  What if we make the VACUUM FULL or CLUSTER log the old
 CTID - new CTID mappings?  This would only need to be done for
 catalog tables, and maybe could be skipped for tuples whose XIDs are
 old enough that we know those transactions must already be decoded.

Ah. If it only were so simple ;). That was my first idea, and after I'd
bragged in an 2ndq internal chat that I'd found a simple idea I
obviously had to realize it doesn't work.

Consider:
INIT_LOGICAL_REPLICATION;
CREATE TABLE foo(...);
BEGIN;
INSERT INTO foo;
ALTER TABLE foo ...;
INSERT INTO foo;
COMMIT TX 3;
VACUUM FULL pg_class;
START_LOGICAL_REPLICATION;

When we decode tx 3 we haven't yet read the mapping from the vacuum
freeze. That scenario can happen either because decoding was stopped for
a moment, or because decoding couldn't keep up (slow connection,
whatever).

There also can be nasty variations where the VACUUM FULL happens while a
transaction is writing data since we don't hold locks on system
relations for very long.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Robert Haas
On Tue, Oct 22, 2013 at 11:02 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-10-22 10:52:48 -0400, Robert Haas wrote:
 On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  So. As it turns out that solution isn't sufficient in the face of VACUUM
  FULL and mixed DML/DDL transaction that have not yet been decoded.
 
  To reiterate, as published it works like:
  For every modification of catalog tuple (insert, multi_insert, update,
  delete) that has influence over visibility issue a record that contains:
  * filenode
  * ctid
  * (cmin, cmax)
 
  When doing a visibility check on a catalog row during decoding of mixed
  DML/DDL transaction lookup (cmin, cmax) for that row since we don't
  store both for the tuple.
 
  That mostly works great.
 
  The problematic scenario is decoding a transaction that has done mixed
  DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM
  FULL obviously changes the filenode and the ctid of a tuple, so we
  cannot successfully do a lookup based on what we logged before.

 So I have a new idea for handling this problem, which seems obvious in
 retrospect.  What if we make the VACUUM FULL or CLUSTER log the old
 CTID - new CTID mappings?  This would only need to be done for
 catalog tables, and maybe could be skipped for tuples whose XIDs are
 old enough that we know those transactions must already be decoded.

 Ah. If it only were so simple ;). That was my first idea, and after I'd
 bragged in an 2ndq internal chat that I'd found a simple idea I
 obviously had to realize it doesn't work.

 Consider:
 INIT_LOGICAL_REPLICATION;
 CREATE TABLE foo(...);
 BEGIN;
 INSERT INTO foo;
 ALTER TABLE foo ...;
 INSERT INTO foo;
 COMMIT TX 3;
 VACUUM FULL pg_class;
 START_LOGICAL_REPLICATION;

 When we decode tx 3 we haven't yet read the mapping from the vacuum
 freeze. That scenario can happen either because decoding was stopped for
 a moment, or because decoding couldn't keep up (slow connection,
 whatever).

That strikes me as a flaw in the implementation rather than the idea.
You're presupposing a patch where the necessary information is
available in WAL yet you don't make use of it at the proper time.  It
seems to me that you have to think of the CTID map as tied to a
relfilenode; if you try to use one relfilenode's map with a different
relfilenode, it's obviously not going to work.  So don't do that.

/me looks innocent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Andres Freund
On 2013-10-18 20:26:16 +0200, Andres Freund wrote:
 4) Store both (cmin, cmax) for catalog tuples.

BTW: That would have the nice side-effect of delivering the basis of
what you need to do parallel sort in a transaction that previously has
performed DDL.

Currently you cannot do anything in parallel after DDL, even if you only
scan the table in one backend, because operators et al. have to do
catalog lookups which you can't do consistently since cmin/cmax aren't
available in both.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Heikki Linnakangas

On 22.10.2013 19:12, Andres Freund wrote:

On 2013-10-18 20:26:16 +0200, Andres Freund wrote:

4) Store both (cmin, cmax) for catalog tuples.


BTW: That would have the nice side-effect of delivering the basis of
what you need to do parallel sort in a transaction that previously has
performed DDL.

Currently you cannot do anything in parallel after DDL, even if you only
scan the table in one backend, because operators et al. have to do
catalog lookups which you can't do consistently since cmin/cmax aren't
available in both.


Parallel workers will need cmin/cmax for user tables too, to know which 
tuples are visible to the snapshot.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Andres Freund
On 2013-10-22 19:19:19 +0300, Heikki Linnakangas wrote:
 On 22.10.2013 19:12, Andres Freund wrote:
 On 2013-10-18 20:26:16 +0200, Andres Freund wrote:
 4) Store both (cmin, cmax) for catalog tuples.
 
 BTW: That would have the nice side-effect of delivering the basis of
 what you need to do parallel sort in a transaction that previously has
 performed DDL.
 
 Currently you cannot do anything in parallel after DDL, even if you only
 scan the table in one backend, because operators et al. have to do
 catalog lookups which you can't do consistently since cmin/cmax aren't
 available in both.
 
 Parallel workers will need cmin/cmax for user tables too, to know which
 tuples are visible to the snapshot.

The existing proposals were mostly about just parallelizing the sort and
similar operations, right? In such scenarios you really need it only for
the catalog.

But we could easily generalize it for user data too. We should even be
able to only use wide cids when we a backend needs it it since
inherently it's only needed within a single transaction.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Heikki Linnakangas

On 22.10.2013 19:23, Andres Freund wrote:

On 2013-10-22 19:19:19 +0300, Heikki Linnakangas wrote:

On 22.10.2013 19:12, Andres Freund wrote:

On 2013-10-18 20:26:16 +0200, Andres Freund wrote:

4) Store both (cmin, cmax) for catalog tuples.


BTW: That would have the nice side-effect of delivering the basis of
what you need to do parallel sort in a transaction that previously has
performed DDL.

Currently you cannot do anything in parallel after DDL, even if you only
scan the table in one backend, because operators et al. have to do
catalog lookups which you can't do consistently since cmin/cmax aren't
available in both.


Parallel workers will need cmin/cmax for user tables too, to know which
tuples are visible to the snapshot.


The existing proposals were mostly about just parallelizing the sort and
similar operations, right? In such scenarios you really need it only for
the catalog.

But we could easily generalize it for user data too. We should even be
able to only use wide cids when we a backend needs it it since
inherently it's only needed within a single transaction.


Or just hand over a copy of the combocid map to the worker, along with 
the snapshot. Seems a lot simpler than this wide cids business..


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Robert Haas
On Tue, Oct 22, 2013 at 12:25 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Or just hand over a copy of the combocid map to the worker, along with the
 snapshot. Seems a lot simpler than this wide cids business..

Yes, that's what Noah and I talked about doing.  Or possibly even
making the map into a hash table in dynamic shared memory, so that new
combo CIDs could be allocated by any backend in the parallel group.
But that seems hard, so for starters I think we'll only parallelize
read-only operations and just hand over a copy of the map.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Andres Freund
On 2013-10-22 19:25:31 +0300, Heikki Linnakangas wrote:
 On 22.10.2013 19:23, Andres Freund wrote:
 On 2013-10-22 19:19:19 +0300, Heikki Linnakangas wrote:
 On 22.10.2013 19:12, Andres Freund wrote:
 On 2013-10-18 20:26:16 +0200, Andres Freund wrote:
 4) Store both (cmin, cmax) for catalog tuples.
 
 BTW: That would have the nice side-effect of delivering the basis of
 what you need to do parallel sort in a transaction that previously has
 performed DDL.
 
 Currently you cannot do anything in parallel after DDL, even if you only
 scan the table in one backend, because operators et al. have to do
 catalog lookups which you can't do consistently since cmin/cmax aren't
 available in both.
 
 Parallel workers will need cmin/cmax for user tables too, to know which
 tuples are visible to the snapshot.
 
 The existing proposals were mostly about just parallelizing the sort and
 similar operations, right? In such scenarios you really need it only for
 the catalog.
 
 But we could easily generalize it for user data too. We should even be
 able to only use wide cids when we a backend needs it it since
 inherently it's only needed within a single transaction.
 
 Or just hand over a copy of the combocid map to the worker, along with the
 snapshot. Seems a lot simpler than this wide cids business..

That's not sufficient if you want to continue writing in the primary
backend though which isn't an uninteresting thing.

I am not saying that parallel XXX is a sufficient reason for this, just
that it might a be a co-benefactor.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Andres Freund
On 2013-10-22 11:59:35 -0400, Robert Haas wrote:
  So I have a new idea for handling this problem, which seems obvious in
  retrospect.  What if we make the VACUUM FULL or CLUSTER log the old
  CTID - new CTID mappings?  This would only need to be done for
  catalog tables, and maybe could be skipped for tuples whose XIDs are
  old enough that we know those transactions must already be decoded.
 
  Ah. If it only were so simple ;). That was my first idea, and after I'd
  bragged in an 2ndq internal chat that I'd found a simple idea I
  obviously had to realize it doesn't work.
 
  Consider:
  INIT_LOGICAL_REPLICATION;
  CREATE TABLE foo(...);
  BEGIN;
  INSERT INTO foo;
  ALTER TABLE foo ...;
  INSERT INTO foo;
  COMMIT TX 3;
  VACUUM FULL pg_class;
  START_LOGICAL_REPLICATION;
 
  When we decode tx 3 we haven't yet read the mapping from the vacuum
  freeze. That scenario can happen either because decoding was stopped for
  a moment, or because decoding couldn't keep up (slow connection,
  whatever).

 It seems to me that you have to think of the CTID map as tied to a
 relfilenode; if you try to use one relfilenode's map with a different
 relfilenode, it's obviously not going to work.  So don't do that.

It has to be tied to relfilenode (+ctid) *and* transaction
unfortunately.
 
 That strikes me as a flaw in the implementation rather than the idea.
 You're presupposing a patch where the necessary information is
 available in WAL yet you don't make use of it at the proper time.

The problem is that the mapping would be somewhere *ahead* from the
transaction/WAL we're currently decoding. We'd need to read ahead till
we find the correct one.
But I think I mainly misunderstood what you proposed. That mapping could
be written besides relfilenode, instead of into the WAL. Then my
imagined problem doesn't exist anymore.

We only would need to write out mappings for tuples modified since the
xmin horizon, so it wouldn't even be *too* bad for bigger relations.

This won't easily work for two+ rewrites because we'd need to apply all
mappings in order and thus would have to keep a history of intermediate
nodes/mappings. But it'd be perfectly doable to simply wait till
decoders are caught up.

I still feel that simply storing both cmin, cmax is cleaner, but if
that's not acceptable, I can certainly live with something like this.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Robert Haas
On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund and...@2ndquadrant.com wrote:
 It seems to me that you have to think of the CTID map as tied to a
 relfilenode; if you try to use one relfilenode's map with a different
 relfilenode, it's obviously not going to work.  So don't do that.

 It has to be tied to relfilenode (+ctid) *and* transaction
 unfortunately.

I agree that it does, but it doesn't seem particularly unfortunate to me.

 That strikes me as a flaw in the implementation rather than the idea.
 You're presupposing a patch where the necessary information is
 available in WAL yet you don't make use of it at the proper time.

 The problem is that the mapping would be somewhere *ahead* from the
 transaction/WAL we're currently decoding. We'd need to read ahead till
 we find the correct one.

Yes, I think that's what you need to do.

 But I think I mainly misunderstood what you proposed. That mapping could
 be written besides relfilenode, instead of into the WAL. Then my
 imagined problem doesn't exist anymore.

That's pretty ugly.  I think it should be written into WAL.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Andres Freund
On 2013-10-22 13:57:53 -0400, Robert Haas wrote:
 On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund and...@2ndquadrant.com wrote:
  That strikes me as a flaw in the implementation rather than the idea.
  You're presupposing a patch where the necessary information is
  available in WAL yet you don't make use of it at the proper time.
 
  The problem is that the mapping would be somewhere *ahead* from the
  transaction/WAL we're currently decoding. We'd need to read ahead till
  we find the correct one.
 
 Yes, I think that's what you need to do.

My problem with that is that rewrite can be gigabytes into the future.

When reading forward we could either just continue reading data into the
reorderbuffer, but delay replaying all future commits till we found the
currently needed remap. That might have quite the additional
storage/memory cost, but runtime complexity should be the same as normal
decoding.
Or we could individually read ahead for every transaction. But doing so
for every transaction will get rather expensive (rougly O(amount_of_wal^2)).

I think that'd be pretty similar to just disallowing VACUUM
FREEZE/CLUSTER on catalog relations since effectively it'd be to
expensive to use.

  But I think I mainly misunderstood what you proposed. That mapping could
  be written besides relfilenode, instead of into the WAL. Then my
  imagined problem doesn't exist anymore.
 
 That's pretty ugly.  I think it should be written into WAL.

It basically has O(1) access, that's why I was thinking about it.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-21 Thread Robert Haas
On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund and...@2ndquadrant.com wrote:
 I know of the following solutions:
 1) Don't allow VACUUM FULL on catalog tables if wal_level = logical.
 2) Make VACUUM FULL prevent DDL and then wait till all changestreams
have decoded up to the current point.
 3) don't delete the old relfilenode for VACUUM/CLUSTERs of system tables
if there are life decoding slots around, instead delegate that
responsibility to the slot management.
 4) Store both (cmin, cmax) for catalog tuples.

 I bascially think only 1) and 4) are realistic. And 1) sucks.

 I've developed a prototype for 4) and except currently being incredibly
 ugly, it seems to be the most promising approach by far. My trick to
 store both cmin and cmax is to store cmax in t_hoff managed space when
 wal_level = logical.

In my opinion, (4) is too ugly to consider.  I think that if we start
playing games like this, we're opening up the doors to lots of subtle
bugs and future architectural pain that will be with us for many, many
years to come.  I believe we will bitterly regret any foray into this
area.

It has long seemed to me to be a shame that we don't have some system
for allowing old relfilenodes to stick around until they are no longer
in use.  If we had that, we might be able to allow utilities like
CLUSTER or VACUUM FULL to permit concurrent read access to the table.
I realize that what people really want is to let those things run
while allowing concurrent *write* access to the table, but a bird in
the hand is worth two in the bush.  What we're really talking about
here is applying MVCC to filesystem actions: instead of removing the
old relfilenode(s) immediately, we do it when they're no longer
referenced by anyone, just as we don't remove old tuples immediately,
but rather when they are no longer referenced by anyone.  The details
are tricky, though: we can allow write access to the *new* heap just
as soon as the rewrite is finished, but anyone who is still looking at
the *old* heap can't ever upgrade their AccessShareLock to anything
higher, or hilarity will ensue.  Also, if they lock some *other*
relation and AcceptInvalidationMessages(), their relcache entry for
the rewritten relation will get rebuilt, and that's bound to work out
poorly.  The net-net here is that I think (3) is an attractive
solution, but I don't know that we can make it work in a reasonable
amount of time.

I don't think I understand exactly what you have in mind for (2); can
you elaborate?  I have always thought that having a
WaitForDecodingToCatchUp() primitive was a good way of handling
changes that were otherwise too difficult to track our way through.  I
am not sure you're doing that at all right now, which in some sense I
guess is fine, but I haven't really understood your aversion to this
solution.  There are some locking issues to be worked out here, but
the problems don't seem altogether intractable.

(1) is basically deciding not to fix the problem.  I don't think
that's acceptable.

I don't have another idea right at the moment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-21 Thread Andres Freund
On 2013-10-21 09:32:12 -0400, Robert Haas wrote:
 On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund and...@2ndquadrant.com wrote:
  I know of the following solutions:
  1) Don't allow VACUUM FULL on catalog tables if wal_level = logical.
  2) Make VACUUM FULL prevent DDL and then wait till all changestreams
 have decoded up to the current point.
  3) don't delete the old relfilenode for VACUUM/CLUSTERs of system tables
 if there are life decoding slots around, instead delegate that
 responsibility to the slot management.
  4) Store both (cmin, cmax) for catalog tuples.
 
  I bascially think only 1) and 4) are realistic. And 1) sucks.
 
  I've developed a prototype for 4) and except currently being incredibly
  ugly, it seems to be the most promising approach by far. My trick to
  store both cmin and cmax is to store cmax in t_hoff managed space when
  wal_level = logical.
 
 In my opinion, (4) is too ugly to consider.  I think that if we start
 playing games like this, we're opening up the doors to lots of subtle
 bugs and future architectural pain that will be with us for many, many
 years to come.  I believe we will bitterly regret any foray into this
 area.

Hm. After looking at the required code - which you obviously cannot have
yet - it's not actually too bad. Will post a patch implementing it later.

I don't really buy the architectural argument since originally cmin/cmax
*were* both stored. It's not something we're just inventing now. We just
optimized that away but now have discovered that's not always a good
idea and thus don't always use the optimization.

The actual decoding code shrinks by about 200 lines using this logic
which is a hint that it's not a bad idea.

 It has long seemed to me to be a shame that we don't have some system
 for allowing old relfilenodes to stick around until they are no longer
 in use.  If we had that, we might be able to allow utilities like
 CLUSTER or VACUUM FULL to permit concurrent read access to the table.
 I realize that what people really want is to let those things run
 while allowing concurrent *write* access to the table, but a bird in
 the hand is worth two in the bush.  What we're really talking about
 here is applying MVCC to filesystem actions: instead of removing the
 old relfilenode(s) immediately, we do it when they're no longer
 referenced by anyone, just as we don't remove old tuples immediately,
 but rather when they are no longer referenced by anyone.  The details
 are tricky, though: we can allow write access to the *new* heap just
 as soon as the rewrite is finished, but anyone who is still looking at
 the *old* heap can't ever upgrade their AccessShareLock to anything
 higher, or hilarity will ensue.  Also, if they lock some *other*
 relation and AcceptInvalidationMessages(), their relcache entry for
 the rewritten relation will get rebuilt, and that's bound to work out
 poorly.  The net-net here is that I think (3) is an attractive
 solution, but I don't know that we can make it work in a reasonable
 amount of time.

I've looked at it before, and I honestly don't have a real clue how to
do it robustly.

 I don't think I understand exactly what you have in mind for (2); can
 you elaborate?  I have always thought that having a
 WaitForDecodingToCatchUp() primitive was a good way of handling
 changes that were otherwise too difficult to track our way through.  I
 am not sure you're doing that at all right now, which in some sense I
 guess is fine, but I haven't really understood your aversion to this
 solution.  There are some locking issues to be worked out here, but
 the problems don't seem altogether intractable.

So, what we need to do for rewriting catalog tables would be:
1) lock table against writes
2) wait for all in-progress xacts to finish, they could have modified
   the table in question (we don't keep locks on system tables)
3) acquire xlog insert pointer
4) wait for all logical decoding actions to read past that pointer
5) upgrade the lock to an access exclusive one
6) perform vacuum full as usual

The lock upgrade hazards in here are the reason I am adverse to the
solution. And I don't see how we can avoid them, since in order for
decoding to catchup it has to be able to read from the
catalog... Otherwise it's easy enough to implement.

 (1) is basically deciding not to fix the problem.  I don't think
 that's acceptable.

I'd like to argue against this, but unfortunately I agree.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-21 Thread Robert Haas
On Mon, Oct 21, 2013 at 1:52 PM, Andres Freund and...@2ndquadrant.com wrote:
  In my opinion, (4) is too ugly to consider.  I think that if we start
  playing games like this, we're opening up the doors to lots of subtle
  bugs and future architectural pain that will be with us for many, many
  years to come.  I believe we will bitterly regret any foray into this
  area.

 Hm. After looking at the required code - which you obviously cannot have
 yet - it's not actually too bad. Will post a patch implementing it later.

 I don't really buy the architectural argument since originally cmin/cmax
 *were* both stored. It's not something we're just inventing now. We just
 optimized that away but now have discovered that's not always a good
 idea and thus don't always use the optimization.

 The actual decoding code shrinks by about 200 lines using this logic
 which is a hint that it's not a bad idea.

 So, here's a preliminary patch to see how this would look. It'd be great
 of you comment if you still think it's a completel no-go.

 If it were for real, it'd need to be split and some minor things would
 need to get adjusted, but I think it's easier to review it seing both
 sides at once.

I think it's a complete no-go.  Consider, e.g., the comment for
MaxTupleAttributeNumber, which you've blithely falsified.  Even if you
update the comment and the value, I'm not inspired by the idea of
subtracting 32 from that number; even if it weren't already too small,
it would break pg_upgrade for any users who are on the edge.  Things
aren't looking too good for anything that uses HeapTupleFields,
either; consider rewrite_heap_tuple().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-21 Thread Andres Freund
On 2013-10-21 14:22:17 -0400, Robert Haas wrote:
 On Mon, Oct 21, 2013 at 1:52 PM, Andres Freund and...@2ndquadrant.com wrote:
   In my opinion, (4) is too ugly to consider.  I think that if we start
   playing games like this, we're opening up the doors to lots of subtle
   bugs and future architectural pain that will be with us for many, many
   years to come.  I believe we will bitterly regret any foray into this
   area.
 
  Hm. After looking at the required code - which you obviously cannot have
  yet - it's not actually too bad. Will post a patch implementing it later.
 
  I don't really buy the architectural argument since originally cmin/cmax
  *were* both stored. It's not something we're just inventing now. We just
  optimized that away but now have discovered that's not always a good
  idea and thus don't always use the optimization.
 
  The actual decoding code shrinks by about 200 lines using this logic
  which is a hint that it's not a bad idea.
 
  So, here's a preliminary patch to see how this would look. It'd be great
  of you comment if you still think it's a completel no-go.
 
  If it were for real, it'd need to be split and some minor things would
  need to get adjusted, but I think it's easier to review it seing both
  sides at once.
 
 I think it's a complete no-go.  Consider, e.g., the comment for
 MaxTupleAttributeNumber, which you've blithely falsified.  Even if you
 update the comment and the value, I'm not inspired by the idea of
 subtracting 32 from that number; even if it weren't already too small,
 it would break pg_upgrade for any users who are on the edge.

Well, we only need to support it for (user_)catalog tables. So
pg_upgrade isn't a problem. And I don't really see a problem restricting
the number of columns for such tables.

 Things
 aren't looking too good for anything that uses HeapTupleFields,
 either; consider rewrite_heap_tuple().

Well, that currently works, by copying cmax. Since rewriting triggered
the change, I am pretty sure I've actually tested  hit that path...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-21 Thread Robert Haas
On Mon, Oct 21, 2013 at 2:27 PM, Andres Freund and...@2ndquadrant.com wrote:
 I think it's a complete no-go.  Consider, e.g., the comment for
 MaxTupleAttributeNumber, which you've blithely falsified.  Even if you
 update the comment and the value, I'm not inspired by the idea of
 subtracting 32 from that number; even if it weren't already too small,
 it would break pg_upgrade for any users who are on the edge.

 Well, we only need to support it for (user_)catalog tables. So
 pg_upgrade isn't a problem. And I don't really see a problem restricting
 the number of columns for such tables.

Inch by inch, this patch set is trying to make catalog tables more and
more different from regular tables.  I think that's a direction we're
going to regret.  I can almost believe that no great harm will come to
us from giving the two different xmin horizons, as previously
discussed, though I'm still not 100% happy about that.  Can't both
have something be a catalog table AND replicate it?  Ick, but OK.

But changing the on disk format strikes me as crossing some sort of
bright line.  That means that you're going to have two different code
paths in a lot of important cases, one for catalog tables and one for
non-catalog tables, and the one that's only taken for catalog tables
will be rather lightly traveled.  And then you've got user catalog
tables, and the possibility that something that wasn't formerly a user
catalog table might become one, or visca versa.  Even if you can flush
out every bug that exists today, this is a recipe for future bugs.

 Things
 aren't looking too good for anything that uses HeapTupleFields,
 either; consider rewrite_heap_tuple().

 Well, that currently works, by copying cmax. Since rewriting triggered
 the change, I am pretty sure I've actually tested  hit that path...

No offense, but that's a laughable statement.  If that path works,
it's mostly if not entirely by accident.  You've fundamentally changed
the heap tuple format, and that code doesn't know about it, even
though it's deeply in bed with assumptions about the old format.  I
think this is a pretty clear indication as to what's wrong with this
approach: a lot of stuff will not care, but the stuff that does care
will be hard to find, and future incremental modifications either to
that code or to the hidden data before the t_hoff pointer could break
stuff that formerly worked.  We rejected early drafts of sepgsql RLS
cold because they changed the tuple format, and I don't see why we
shouldn't do exactly the same thing here.

But just suppose for a minute that we'd accepted that proposal and
then took this one, too.  And let's suppose we also accept the next
proposal that, like that one and this one, jams something more into
the heap tuple header.  At that point you'll have potentially as many
as 8 different maximum-number-of-attributes values for tuples, though
maybe not quite that many in practice if not all of those features can
be used together.  The macros that are needed to extract the various
values from the heap tuple will be nightmarishly complex, and we'll
have eaten up all (or more than all) of our remaining bit-space in the
infomask.  Maybe all of that sounds all right to you, but to me it
sounds like a big mess.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-21 Thread Andres Freund
On 2013-10-21 14:50:54 -0400, Robert Haas wrote:
 On Mon, Oct 21, 2013 at 2:27 PM, Andres Freund and...@2ndquadrant.com wrote:
  I think it's a complete no-go.  Consider, e.g., the comment for
  MaxTupleAttributeNumber, which you've blithely falsified.  Even if you
  update the comment and the value, I'm not inspired by the idea of
  subtracting 32 from that number; even if it weren't already too small,
  it would break pg_upgrade for any users who are on the edge.
 
  Well, we only need to support it for (user_)catalog tables. So
  pg_upgrade isn't a problem. And I don't really see a problem restricting
  the number of columns for such tables.
 
 Inch by inch, this patch set is trying to make catalog tables more and
 more different from regular tables.  I think that's a direction we're
 going to regret.

I can understand that.

 Can't both have something be a catalog table AND replicate it?  Ick,
 but OK.

User catalog tables are replicated normally.

If we want we can replicate system catalog tables as well, ok, it's
about a days worth of work. I just don't see what the use case would
be and I am afraid it'd be used as a poor man's system table trigger.

 But changing the on disk format strikes me as crossing some sort of
 bright line.  That means that you're going to have two different code
 paths in a lot of important cases, one for catalog tables and one for
 non-catalog tables, and the one that's only taken for catalog tables
 will be rather lightly traveled.

But there really isn't that much difference in the code paths, is there?
If you look at it, minus the mechanical changes around bitmasks it
really mostly just a couple of added
HeapTupleHeaderSetCmin/HeapTupleHeaderSetCmax calls + wal logging the
current cid.

 And then you've got user catalog tables, and the possibility that
 something that wasn't formerly a user catalog table might become one,
 or visca versa.

That's not really a problem - we only need cmin/cmax for catalog tables
when decoding DML that happened in the same transaction as DDL. The wide
cids could easily be removed when we freeze tuples or such.
Even something like:
BEGIN;
CREATE TABLE catalog_t(...);
INSERT INTO catalog_t ...;
UPDATE catalog_t ...
ALTER TABLE catalog_t SET (user_catalog_table = true);
INSERT INTO decoded_table (..);
INSERT INTO decoded_table (..);
UPDATE catalog_t SET ...;
INSERT INTO decoded_table (..);
COMMIT;
will work correctly.

 Even if you can flush out every bug that exists today, this is a
 recipe for future bugs.

I don't really forsee many new codepaths that care about the visibility
bits in tuple headers themselves. When has the last one of those been
added/changed? I'd bet it was 9.0 with the vacuum full/cluster merge,
and three lines to be added are surely the smallest problem of that.

  Things
  aren't looking too good for anything that uses HeapTupleFields,
  either; consider rewrite_heap_tuple().
 
  Well, that currently works, by copying cmax. Since rewriting triggered
  the change, I am pretty sure I've actually tested  hit that path...
 
 No offense, but that's a laughable statement.  If that path works,
 it's mostly if not entirely by accident.  You've fundamentally changed
 the heap tuple format, and that code doesn't know about it, even
 though it's deeply in bed with assumptions about the old format.

Hm? rewrite_heap_tuple() grew code to handle that case, it's not an
accident that it works.

if (HeapTupleHeaderHasWideCid(old_tuple-t_data)
 HeapTupleHeaderHasWideCid(new_tuple-t_data))
{
HeapTupleHeaderSetCmin(new_tuple-t_data,
   
HeapTupleHeaderGetCmin(old_tuple-t_data));
HeapTupleHeaderSetCmax(new_tuple-t_data,
   
HeapTupleHeaderGetCmax(old_tuple-t_data), false);
}


Note that code handling HeapTupleHeaders outside of heapam.c, tqual.c et
al. shouldn't need to care about the changes at all.
And all the code that needs to care *already* has special-cased code
around visibility. It's relatively easy to find by grepping for
HEAP_XACT_MASK. We've so far accepted that several places need to change
if we change the visibility rules. And I don't see how we easily could
get away from that.
Patches like e.g. lsn-ranges freezing will require *gobs* more
widespread changes. And have much higher chances of breaking stuff
imo. And it's still worthwile to do them.

 I
 think this is a pretty clear indication as to what's wrong with this
 approach: a lot of stuff will not care, but the stuff that does care
 will be hard to find, and future incremental modifications either to
 that code or to the hidden data before the t_hoff pointer could break
 stuff that formerly worked.  We rejected early drafts of sepgsql RLS
 cold because they changed the tuple format, and I don't see why we
 shouldn't do exactly the same thing here.

Ok, I have a hard time to argue against that. I either skipped or forgot
that 

Re: [HACKERS] logical changeset generation v6.2

2013-10-21 Thread Andres Freund
Hi,


On 2013-10-21 21:36:02 +0200, Andres Freund wrote:
 I think this approach would have lower maintenance overhead in
 comparison to the previous solution of Handling CommandIds because it's
 actually much simpler.

Btw, I think the new approach would allow for *easier* modifications
about future code caring about visibility. Since the physical location
doesn't matter anymore it'd be much more friendly towards things like
an online compacting VACUUM and such.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-18 Thread Andres Freund
On 2013-10-14 09:36:03 -0400, Robert Haas wrote:
  I thought and implemented that in the beginning. Unfortunately it's not
  enough :(. That's probably the issue that took me longest to understand
  in this patchseries...
 
  Combocids can only fix the case where a transaction actually has create
  a combocid:
 
  1) TX1: INSERT id = 1 at 0/1: (xmin = 1, xmax=Invalid, cmin = 55, cmax = 
  Invalid)
  2) TX2: DELETE id = 1 at 0/1: (xmin = 1, xmax=2, cmin = Invalid, cmax = 1)
 
  So, if we're decoding data that needs to lookup those rows in TX1 or TX2
  we both times need access to cmin and cmax, but neither transaction will
  have created a multixact. That can only be an issue in transaction with
  catalog modifications.
 
 Oh, yuck.  So that means you have to write an extra WAL record for
 EVERY heap insert, update, or delete to a catalog table?  OUCH.

So. As it turns out that solution isn't sufficient in the face of VACUUM
FULL and mixed DML/DDL transaction that have not yet been decoded.

To reiterate, as published it works like:
For every modification of catalog tuple (insert, multi_insert, update,
delete) that has influence over visibility issue a record that contains:
* filenode
* ctid
* (cmin, cmax)

When doing a visibility check on a catalog row during decoding of mixed
DML/DDL transaction lookup (cmin, cmax) for that row since we don't
store both for the tuple.

That mostly works great.

The problematic scenario is decoding a transaction that has done mixed
DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM
FULL obviously changes the filenode and the ctid of a tuple, so we
cannot successfully do a lookup based on what we logged before.

I know of the following solutions:
1) Don't allow VACUUM FULL on catalog tables if wal_level = logical.
2) Make VACUUM FULL prevent DDL and then wait till all changestreams
   have decoded up to the current point.
3) don't delete the old relfilenode for VACUUM/CLUSTERs of system tables
   if there are life decoding slots around, instead delegate that
   responsibility to the slot management.
4) Store both (cmin, cmax) for catalog tuples.

I bascially think only 1) and 4) are realistic. And 1) sucks.

I've developed a prototype for 4) and except currently being incredibly
ugly, it seems to be the most promising approach by far. My trick to
store both cmin and cmax is to store cmax in t_hoff managed space when
wal_level = logical.
That even works when changing wal_level from  logical to logical
because only ever need to store both cmin and cmax for transactions that
have decodeable content - which they cannot yet have before wal_level =
logical.

This requires some not so nice things:
* A way to declare we're storing both. I've currently chosen
  HEAP_MOVED_OFF | HEAP_MOVED_IN. That sucks.
* A way for heap_form_tuple to know it should add the necessary space to
  t_hoff. I've added TupleDesc-tdhaswidecid for it.
* Fiddling with existing checks for HEAP_MOVED{,OFF,IN} to check for
  both set at the same time.
* Changing the WAL logging to (optionally?) transport the current
  CommandId instead of always resetting it InvalidCommandId.

The benefits are:
* Working VACUUM FULL
* Much simpler tqual.c logic, everything is stored in the row itself. No
  hash or something like that built.
* No more need to log (relfilenode, cmin, cmax) separately from heap
  changes itself anymore.

In the end, the costs are that individual catalog rows are 4 bytes
bigger iff wal_level = logical. That seems acceptable.

Some questions remain:
* Better idea for a flag than HEAP_MOVED_OFF | HEAP_MOVED_IN
* Should we just unconditionally log the current CommandId or make it
  conditional. We have plenty of flag space to signal whether it's
  present, but it's just 4 bytes.

Comments?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Robert Haas
On Mon, Oct 14, 2013 at 9:51 AM, Andres Freund and...@2ndquadrant.com wrote:
 Well, I just think relying on specific symbol names in the .so file is
 kind of unfortunate.  It means that, for example, you can't have
 multiple output plugins provided by a single .so.  And in general I
 think it's something that we've tried to minimize.

 But that's not really different when you rely on _PG_init doing it's
 thing, right?

Sure, that's true.  But in general I think magic symbol names aren't a
particularly good design.

 But there's only so much information available here.  Why not just
 have a format that logs it all?

 Because we do not know what all is? Also, how would we handle
 replication sets and such that all of the existing replication solutions
 have generically?

I don't see how you can fail to know what all is.  There's only a
certain set of facts available.  I mean you could log irrelevant crap
like a random number that you just picked or the sum of all numeric
values in the column, but nobody's likely to want that.  What people
are going to want is the operation performed (insert, update, or
delete), all the values in the new tuple, the key values from the old
tuple, the transaction ID, and maybe some meta-information about the
transaction (such as the commit timestamp).  What I'd probably do is
emit the data in CSV format, with the first column of each line being
a single character indicating what sort of row this is: H means a
header row, defining the format of subsequent rows
(H,table_name,new_column1,...,new_columnj,old_key_column1,...,old_key_columnk;
a new header row is emitted only when the column list changes); I, U,
or D means an insert, update, or delete, with column 2 being the
transaction ID, column 3 being the table name, and the remaining
columns matching the last header row for emitted for that table, T
means meta-information about a transaction, whatever we have (e.g.
T,txn_id,commit_time).  There's probably some further tweaking of that
that could be done, and I might be overlooking some salient details,
like maybe we want to indicate the column types as well as their
names, but the range of things that someone can want to do here is not
unlimited.  The point, for me anyway, is that someone can write a
crappy Perl script to apply changes from a file like this in a day.
My contention is that there are a lot of people who will want to do
just that, for one reason or another.  The plugin interface has
awesome power and flexibility, and really high-performance replication
solutions will really benefit from that.  But regular people don't
want to write C code; they just want to write a crappy Perl script.
And I think we can facilitate that without too much work.

 Oh, yuck.  So that means you have to write an extra WAL record for
 EVERY heap insert, update, or delete to a catalog table?  OUCH.

 Yes. We could integrate it into the main record without too many
 problems, but it didn't seem like an important optimization and it would
 have higher chances of slowing down wal_level  logical.

Hmm.  I don't know whether that's an important optimization or not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Robert Haas
On Mon, Oct 14, 2013 at 5:07 PM, Andres Freund and...@2ndquadrant.com wrote:
 So, see the attatched benchmark skript. I've always done using a disk
 bound and a memory bound (using eatmydata, preventing fsyncs) run.

 * unpatched run, wal_level = hot_standby, eatmydata
 * unpatched run, wal_level = hot_standby

 * patched run, wal_level = hot_standby, eatmydata
 * patched run, wal_level = hot_standby

 * patched run, wal_level = logical, eatmydata
 * patched run, wal_level = logical

 Based on those results, there's no difference above noise for
 wal_level=hot_standby, with or without the patch. With wal_level=logical
 there's a measurable increase in wal traffic (~12-17%), but no
 performance decrease above noise.

 From my POV that's ok, those are really crazy catalog workloads.

Any increase in WAL traffic will translate into a performance hit once
the I/O channel becomes saturated, but I agree those numbers don't
sound terrible for that faily-brutal test case. Actually, I was more
concerned about the hit on non-catalog workloads.  pgbench isn't a
good test because the key column is so narrow; but suppose we have a
table like (a text, b integer, c text) where (a, c) is the primary key
and those strings are typically pretty long - say just short enough
that we can still index the column.  It'd be worth testing both
workloads where the primary key doesn't change (so the only overhead
is figuring out that we need not log it) and those where it does
(where we're double-logging most of the tuple).  I assume the latter
has to produce a significant hit to WAL volume, and I don't think
there's much we can do about that; but the former had better be nearly
free.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Andres Freund
On 2013-10-15 08:42:20 -0400, Robert Haas wrote:
 On Mon, Oct 14, 2013 at 9:51 AM, Andres Freund and...@2ndquadrant.com wrote:
  Well, I just think relying on specific symbol names in the .so file is
  kind of unfortunate.  It means that, for example, you can't have
  multiple output plugins provided by a single .so.  And in general I
  think it's something that we've tried to minimize.
 
  But that's not really different when you rely on _PG_init doing it's
  thing, right?
 
 Sure, that's true.  But in general I think magic symbol names aren't a
 particularly good design.

It allows you to use the shared libary both as a normal extension loaded
via shared_preload_library or adhoc and as an output plugin which seems
like a sensible goal.
We could have a single _PG_init_output_plugin() symbol that fills in
such a struct which would then not conflict with using the .so
independently. If you prefer that I'll change things around.

We can't do something like 'output_plugin_in_progress' before calling
_PG_init() because _PG_init() won't be called again if the shared object
is already loaded...

  But there's only so much information available here.  Why not just
  have a format that logs it all?
 
  Because we do not know what all is? Also, how would we handle
  replication sets and such that all of the existing replication solutions
  have generically?
 
 I don't see how you can fail to know what all is.  There's only a
 certain set of facts available.  I mean you could log irrelevant crap
 like a random number that you just picked or the sum of all numeric
 values in the column, but nobody's likely to want that.  What people
 are going to want is the operation performed (insert, update, or
 delete), all the values in the new tuple, the key values from the old
 tuple, the transaction ID, and maybe some meta-information about the
 transaction (such as the commit timestamp).

Some will want all column names included because that makes replication
into different schemas/databases easier, others won't because it makes
replicating the data more complicated and expensive.
Lots will want the primary key as a separate set of columns even for
inserts, others not.
There's also datatypes of values and null representation.

 What I'd probably do is
 emit the data in CSV format, with the first column of each line being
 a single character indicating what sort of row this is: H means a
 header row, defining the format of subsequent rows
 (H,table_name,new_column1,...,new_columnj,old_key_column1,...,old_key_columnk;
 a new header row is emitted only when the column list changes); I, U,
 or D means an insert, update, or delete, with column 2 being the
 transaction ID, column 3 being the table name, and the remaining
 columns matching the last header row for emitted for that table, T
 means meta-information about a transaction, whatever we have (e.g.
 T,txn_id,commit_time).

There's two issues I have with this:
a) CSV seems like a bad format for this. If a transaction inserts into
multiple tables the number of columns will constantly change. Many CSV
parsers don't deal with that all too gracefully. E.g. you can't even
load the data into another postgres database as an audit log.

If we go for CSV I think we should put the entire primary key as one
column (containing all the columns) and the entire row another.

We also don't have any nice facilities for actually writing CSV - so
we'll need to start extracting escaping code from COPY. In the end all
that will make the output plugin very hard to use as an example because
the code will get more complicated.

b) Emitting new row descriptors everytime the schema changes will
require keeping track of the schema. I think that won't be trivial. It
also makes consumption of the data more complicated in comparison to
including the description with every row.

Both are even more true once we extend the format to support streaming
of transactions while they are performed.

 But regular people don't want to write C code; they just want to write
 a crappy Perl script.  And I think we can facilitate that without too
 much work.

I think the generic output plugin should be a separate one from the
example one (which is the one included in the patchset).

  Oh, yuck.  So that means you have to write an extra WAL record for
  EVERY heap insert, update, or delete to a catalog table?  OUCH.
 
  Yes. We could integrate it into the main record without too many
  problems, but it didn't seem like an important optimization and it would
  have higher chances of slowing down wal_level  logical.
 
 Hmm.  I don't know whether that's an important optimization or not.

Based on the benchmark I'd say no. If we discover we need to go there we
can do so later. I don't forsee this to be really problematic.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list 

Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Andres Freund
On 2013-10-15 08:49:26 -0400, Robert Haas wrote:
 On Mon, Oct 14, 2013 at 5:07 PM, Andres Freund and...@2ndquadrant.com wrote:
  So, see the attatched benchmark skript. I've always done using a disk
  bound and a memory bound (using eatmydata, preventing fsyncs) run.
 
  * unpatched run, wal_level = hot_standby, eatmydata
  * unpatched run, wal_level = hot_standby
 
  * patched run, wal_level = hot_standby, eatmydata
  * patched run, wal_level = hot_standby
 
  * patched run, wal_level = logical, eatmydata
  * patched run, wal_level = logical
 
  Based on those results, there's no difference above noise for
  wal_level=hot_standby, with or without the patch. With wal_level=logical
  there's a measurable increase in wal traffic (~12-17%), but no
  performance decrease above noise.
 
  From my POV that's ok, those are really crazy catalog workloads.
 
 Any increase in WAL traffic will translate into a performance hit once
 the I/O channel becomes saturated, but I agree those numbers don't
 sound terrible for that faily-brutal test case.

Well, the parallel workloads were fsync saturated although probably not
throughput, that's why I added them. But yes, it's not the same as a
throughput saturated IO channel.
Probably the worst case real-world workload is one that uses lots and
lots of ON COMMIT DROP temporary tables.

 Actually, I was more concerned about the hit on non-catalog workloads.  
 pgbench isn't a
 good test because the key column is so narrow; but suppose we have a
 table like (a text, b integer, c text) where (a, c) is the primary key
 and those strings are typically pretty long - say just short enough
 that we can still index the column.  It'd be worth testing both
 workloads where the primary key doesn't change (so the only overhead
 is figuring out that we need not log it) and those where it does
 (where we're double-logging most of the tuple).  I assume the latter
 has to produce a significant hit to WAL volume, and I don't think
 there's much we can do about that; but the former had better be nearly
 free.

Ah, ok. Then I misunderstood you.

Is there a specific overhead you are afraid of in the
pkey-doesn't-change scenario? The changed wal logging (buffer in a
separate rdata entry) or the check whether the primary key has changed?

The only way I have been able to measure differences in that scenario
was to load a table with a low fillfactor and wide tuples, checkpoint,
and then update lots of rows. On wal_level=logical that will result in
full-page-images and tuple data being logged which can be noticeable if
you have really large tuples, even if the pkey doesn't change.

We could optimize that by not actually logging the tuple data in that
case but just include the tid so we could extract things from the Bkp
block ourselves. But that will complicate the code and doesn't yet seem
warranted.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Andres Freund
On 2013-10-15 15:17:58 +0200, Andres Freund wrote:
 If we go for CSV I think we should put the entire primary key as one
 column (containing all the columns) and the entire row another.

What about columns like:
* action B|I|U|D|C

* xid
* timestamp

* tablename

* key name
* key column names
* key column types

* new key column values

* column names
* column types
* column values

* candidate_key_changed?
* old key column values

And have output plugin options
* include-column-types
* include-column-names
* include-primary-key

If something isn't included it's simply left out.

What still need to be determined is:
* how do we separate and escape multiple values in one CSV column
* how do we represent NULLs

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Robert Haas
On Tue, Oct 15, 2013 at 9:17 AM, Andres Freund and...@2ndquadrant.com wrote:
 It allows you to use the shared libary both as a normal extension loaded
 via shared_preload_library or adhoc and as an output plugin which seems
 like a sensible goal.
 We could have a single _PG_init_output_plugin() symbol that fills in
 such a struct which would then not conflict with using the .so
 independently. If you prefer that I'll change things around.

I think part of the problem may be that you're using the library name
to identify the output plugin.  I'm not excited about that design.
For functions, you give the function a name and that is a pointer to
where to actually find the function, which may be a 2-tuple
library-name, function-name, or perhaps just a 1-tuple
builtin-function-name, or maybe the whole text of a PL/pgsql
procedure that should be compiled.

Perhaps this ought to work similarly.  Create a function in pg_proc
which returns the structure containing the function pointers.  Then,
when that output plugin is selected, it'll automatically trigger
loading the correct shared library if that's needed; and the shared
library name may (but need not) match the output plugin name.

 What I'd probably do is
 emit the data in CSV format, with the first column of each line being
 a single character indicating what sort of row this is: H means a
 header row, defining the format of subsequent rows
 (H,table_name,new_column1,...,new_columnj,old_key_column1,...,old_key_columnk;
 a new header row is emitted only when the column list changes); I, U,
 or D means an insert, update, or delete, with column 2 being the
 transaction ID, column 3 being the table name, and the remaining
 columns matching the last header row for emitted for that table, T
 means meta-information about a transaction, whatever we have (e.g.
 T,txn_id,commit_time).

 There's two issues I have with this:
 a) CSV seems like a bad format for this. If a transaction inserts into
 multiple tables the number of columns will constantly change. Many CSV
 parsers don't deal with that all too gracefully. E.g. you can't even
 load the data into another postgres database as an audit log.

We can pick some other separator.  I don't think ragged CSV is a big
problem; I'm actually more worried about having an easy way to handle
embedded commas and newlines and so on.  But I'd be fine with
tab-separated data or something too, if you think that's better.  What
I want is something that someone can parse with a script that can be
written in a reasonable amount of time in their favorite scripting
language.  I predict that if we provide something like this we'll
vastly expand the number of users who can make use of this new
functionality.

User: So, what's new in PostgreSQL 9.4?
Hacker: Well, now we have logical replication!
User: Why is that cool?
Hacker: Well, streaming replication is awesome for HA, but it has
significant limitations.  And trigger-based systems are very mature,
but the overhead is high and their lack of core integration makes them
hard to use.  With this technology, you can build systems that will
replicate individual tables or even parts of tables, multi-master
systems, and lots of other cool stuff.
User: Wow, that sounds great.  How do I use it?
Hacker: Well, first you write an output plugin in C using a special API.
User: Hey, do you know whether the MongoDB guys came to this conference?

Let's try that again.

User: Wow, that sounds great.  How do I use it?
Hacker: Well, currently, the output gets dumped as a series of text
files that are designed to be parsed using a scripting language.  We
have sample parsers written in Perl and Python that you can use as-is
or hack up to meet your needs.

Now, some users are still going to head for the hills.  But at least
from where I sit it sounds a hell of a lot better than the first
answer.  We're not going to solve all of the tooling problems around
this technology in one release, for sure.  But as far as 95% of our
users are concerned, a C API might as well not exist at all.  People
WILL try to machine parse the output of whatever demo plugins we
provide; so I think we should try hard to provide at least one such
plugin that is designed to make that as easy as possible.

 If we go for CSV I think we should put the entire primary key as one
 column (containing all the columns) and the entire row another.

 We also don't have any nice facilities for actually writing CSV - so
 we'll need to start extracting escaping code from COPY. In the end all
 that will make the output plugin very hard to use as an example because
 the code will get more complicated.

 b) Emitting new row descriptors everytime the schema changes will
 require keeping track of the schema. I think that won't be trivial. It
 also makes consumption of the data more complicated in comparison to
 including the description with every row.

 Both are even more true once we extend the format to support streaming
 of transactions while they are 

Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Hannu Krosing
On 10/15/2013 01:42 PM, Robert Haas wrote:
 On Mon, Oct 14, 2013 at 9:51 AM, Andres Freund and...@2ndquadrant.com wrote:
 Well, I just think relying on specific symbol names in the .so file is
 kind of unfortunate.  It means that, for example, you can't have
 multiple output plugins provided by a single .so.  And in general I
 think it's something that we've tried to minimize.
 But that's not really different when you rely on _PG_init doing it's
 thing, right?
 Sure, that's true.  But in general I think magic symbol names aren't a
 particularly good design.

 But there's only so much information available here.  Why not just
 have a format that logs it all?
 Because we do not know what all is? Also, how would we handle
 replication sets and such that all of the existing replication solutions
 have generically?
 I don't see how you can fail to know what all is.  
We instinctively know what all is - as in the famous case of buddhist
ordering a
hamburger - Make me All wit Everything :) - but the requirements of
different  replications systems vary wildly.

 ...
 What people
 are going to want is the operation performed (insert, update, or
 delete), all the values in the new tuple, the key values from the old
 tuple, 
For multi-master / conflict resolution you may also want all old
values to make sure that they have not changed on target.

the difference in WAL volume can be really significant, especially
in the case of DELETE, where there are no new columns.

for some forms of conflict resolution we may even want to know
the database user who initiated the operation. and possibly even
some session variables like very_important=yes.

 ... The point, for me anyway, is that someone can write a
 crappy Perl script to apply changes from a file like this in a day.
 My contention is that there are a lot of people who will want to do
 just that, for one reason or another.  The plugin interface has
 awesome power and flexibility, and really high-performance replication
 solutions will really benefit from that.  But regular people don't
 want to write C code; they just want to write a crappy Perl script.
 And I think we can facilitate that without too much work.
just provide a to-csv or to-json plugin and the crappy perl guys
are happy.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Hannu Krosing
On 10/15/2013 02:47 PM, Andres Freund wrote:
 On 2013-10-15 15:17:58 +0200, Andres Freund wrote:
 If we go for CSV I think we should put the entire primary key as one
 column (containing all the columns) and the entire row another.
just use JSON :)
 What about columns like:
 * action B|I|U|D|C

 * xid
 * timestamp

 * tablename

 * key name
 * key column names
 * key column types

 * new key column values

 * column names
 * column types
 * column values

 * candidate_key_changed?
 * old key column values

 And have output plugin options
 * include-column-types
 * include-column-names
 * include-primary-key

 If something isn't included it's simply left out.

 What still need to be determined is:
 * how do we separate and escape multiple values in one CSV column
 * how do we represent NULLs

or borrow whatever possible from pg_dump as they have
needed to solve most of the same problems already and
consistency is good in general

 Greetings,

 Andres Freund




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Robert Haas
On Tue, Oct 15, 2013 at 9:47 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-10-15 15:17:58 +0200, Andres Freund wrote:
 If we go for CSV I think we should put the entire primary key as one
 column (containing all the columns) and the entire row another.

 What about columns like:
 * action B|I|U|D|C

BEGIN and COMMIT?

 * xid
 * timestamp

 * tablename

 * key name
 * key column names
 * key column types

 * new key column values

 * column names
 * column types
 * column values

 * candidate_key_changed?
 * old key column values

Repeating the column names for every row strikes me as a nonstarter.
If the plugin interface isn't rich enough to provide a convenient way
to avoid that, then it needs to be fixed so that it is, because it
will be a common requirement.  Sure, some people may want JSON or XML
output that reiterates the labels every time, but for a lot of people
that's going to greatly increase the size of the output and be
undesirable for that reason.

 What still need to be determined is:
 * how do we separate and escape multiple values in one CSV column
 * how do we represent NULLs

I consider the escaping a key design decision.  Ideally, it should be
something that's easy to reverse from a scripting language; ideally
also, it should be something similar to how we handle COPY.  These
goals may be in conflict; we'll have to pick something.

I'm not sure that having multiple values in one column is a good plan,
because now you need multiple levels of parsing to unpack the row.
I'd rather just have a flat column list with a key somewhere
explaining how to interpret the data.  But I'm prepared to give in on
that point so long as we can demonstrate that the format can be easily
parsed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Robert Haas
On Tue, Oct 15, 2013 at 10:09 AM, Hannu Krosing ha...@krosing.net wrote:
 I don't see how you can fail to know what all is.
 We instinctively know what all is - as in the famous case of buddhist
 ordering a
 hamburger - Make me All wit Everything :) - but the requirements of
 different  replications systems vary wildly.

That's true to some degree, but let's not exaggerate the degree to
which it is true.

 For multi-master / conflict resolution you may also want all old
 values to make sure that they have not changed on target.

The patch as proposed doesn't make that information available.  If you
want that to be an option, now would be the right time to argue for
it.

 for some forms of conflict resolution we may even want to know
 the database user who initiated the operation. and possibly even
 some session variables like very_important=yes.

Well, if you have requirements like logging very_important=yes, then
you're definitely into the territory where you need your own output
plugin.  I have no problem telling people who want that sort of thing
that they've got to go write C code.  What I'm trying to do, as Larry
Wall once said, is to make simple things simple and hard things
possible.  The output plugin interface accomplishes the latter, but,
by itself, not the former.

 And I think we can facilitate that without too much work.
 just provide a to-csv or to-json plugin and the crappy perl guys
 are happy.

Yep, that's exactly what I'm advocating for.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Andres Freund
On 2013-10-15 10:20:55 -0400, Robert Haas wrote:
  For multi-master / conflict resolution you may also want all old
  values to make sure that they have not changed on target.
 
 The patch as proposed doesn't make that information available.  If you
 want that to be an option, now would be the right time to argue for
 it.

I don't think you necessarily want it for most MM solutions, but I agree
it will be useful for some scenarios.

I think the ReorderBufferChange struct needs a better way to distinguish
between old-key and old-tuple now, but I'd rather implement the
facililty for logging the full old tuple in a separate patch. The
patchset is big enough as is, lets not tack on more features.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Andres Freund
On 2013-10-15 10:09:05 -0400, Robert Haas wrote:
 On Tue, Oct 15, 2013 at 9:17 AM, Andres Freund and...@2ndquadrant.com wrote:
  It allows you to use the shared libary both as a normal extension loaded
  via shared_preload_library or adhoc and as an output plugin which seems
  like a sensible goal.
  We could have a single _PG_init_output_plugin() symbol that fills in
  such a struct which would then not conflict with using the .so
  independently. If you prefer that I'll change things around.
 
 I think part of the problem may be that you're using the library name
 to identify the output plugin.  I'm not excited about that design.
 For functions, you give the function a name and that is a pointer to
 where to actually find the function, which may be a 2-tuple
 library-name, function-name, or perhaps just a 1-tuple
 builtin-function-name, or maybe the whole text of a PL/pgsql
 procedure that should be compiled.

That means you allow trivial remote code execution since you could try
to load system() or something else that's available in every shared
object. Now you can argue that that's OK since we have special checks
for replication connections, but I'd rather not go there.

 Perhaps this ought to work similarly.  Create a function in pg_proc
 which returns the structure containing the function pointers.  Then,
 when that output plugin is selected, it'll automatically trigger
 loading the correct shared library if that's needed; and the shared
 library name may (but need not) match the output plugin name.

I'd like to avoid relying on inserting stuff into pg_proc because that
makes it harder to extract WAL from a HS standby. Requiring to configure
that on the primary to extract data on the standby seems confusing to
me.

But perhaps that's the correct solution :/

 Now, some users are still going to head for the hills.  But at least
 from where I sit it sounds a hell of a lot better than the first
 answer.  We're not going to solve all of the tooling problems around
 this technology in one release, for sure.  But as far as 95% of our
 users are concerned, a C API might as well not exist at all.  People
 WILL try to machine parse the output of whatever demo plugins we
 provide; so I think we should try hard to provide at least one such
 plugin that is designed to make that as easy as possible.

Well, just providing the C API + an example in a first step didn't work
out too badly for FDWs. I am pretty sure that once released there will
soon be extensions for it on PGXN or whatever for special usecases.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Robert Haas
On Tue, Oct 15, 2013 at 10:27 AM, Andres Freund and...@2ndquadrant.com wrote:
 I think part of the problem may be that you're using the library name
 to identify the output plugin.  I'm not excited about that design.
 For functions, you give the function a name and that is a pointer to
 where to actually find the function, which may be a 2-tuple
 library-name, function-name, or perhaps just a 1-tuple
 builtin-function-name, or maybe the whole text of a PL/pgsql
 procedure that should be compiled.

 That means you allow trivial remote code execution since you could try
 to load system() or something else that's available in every shared
 object. Now you can argue that that's OK since we have special checks
 for replication connections, but I'd rather not go there.

Well, obviously you can't let somebody load any library they want.
But that's pretty much true anyway; LOAD had better be confined to
superusers unless there is something (like a pg_proc entry) that
provides prior authorization for that specific load.

 Perhaps this ought to work similarly.  Create a function in pg_proc
 which returns the structure containing the function pointers.  Then,
 when that output plugin is selected, it'll automatically trigger
 loading the correct shared library if that's needed; and the shared
 library name may (but need not) match the output plugin name.

 I'd like to avoid relying on inserting stuff into pg_proc because that
 makes it harder to extract WAL from a HS standby. Requiring to configure
 that on the primary to extract data on the standby seems confusing to
 me.

 But perhaps that's the correct solution :/

That's a reasonable concern.  I don't have another idea at the moment,
unless we want to allow replication connections to issue LOAD
commands.  Then you can LOAD the library, so that the plug-in is
registered under the well-known name you expect it to have, and then
use that name to start replication.

 Now, some users are still going to head for the hills.  But at least
 from where I sit it sounds a hell of a lot better than the first
 answer.  We're not going to solve all of the tooling problems around
 this technology in one release, for sure.  But as far as 95% of our
 users are concerned, a C API might as well not exist at all.  People
 WILL try to machine parse the output of whatever demo plugins we
 provide; so I think we should try hard to provide at least one such
 plugin that is designed to make that as easy as possible.

 Well, just providing the C API + an example in a first step didn't work
 out too badly for FDWs. I am pretty sure that once released there will
 soon be extensions for it on PGXN or whatever for special usecases.

I suspect so, too.  But I also think that if that's the only thing
available in the first release, a lot of users will get a poor initial
impression.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Andres Freund
On 2013-10-15 10:15:14 -0400, Robert Haas wrote:
 On Tue, Oct 15, 2013 at 9:47 AM, Andres Freund and...@2ndquadrant.com wrote:
  On 2013-10-15 15:17:58 +0200, Andres Freund wrote:
  If we go for CSV I think we should put the entire primary key as one
  column (containing all the columns) and the entire row another.
 
  What about columns like:
  * action B|I|U|D|C
 
 BEGIN and COMMIT?

That's B and C, yes. You'd rather not have them? When would you replay
the commit without an explicit message telling you to?

 Repeating the column names for every row strikes me as a nonstarter.
 [...]
 Sure, some people may want JSON or XML
 output that reiterates the labels every time, but for a lot of people
 that's going to greatly increase the size of the output and be
 undesirable for that reason.

But I argue that most simpler users - which are exactly the ones a
generic output plugin is aimed at - will want all column names since it
makes replay far easier.

 If the plugin interface isn't rich enough to provide a convenient way
 to avoid that, then it needs to be fixed so that it is, because it
 will be a common requirement.

Oh, it surely is possibly to avoid repeating it. The output plugin
interface simply gives you a relcache entry, that contains everything
necessary.
The output plugin would need to keep track of whether it has output data
for a specific relation and it would need to check whether the table
definition has changed, but I don't see how we could avoid that?

  What still need to be determined is:
  * how do we separate and escape multiple values in one CSV column
  * how do we represent NULLs
 
 I consider the escaping a key design decision.  Ideally, it should be
 something that's easy to reverse from a scripting language; ideally
 also, it should be something similar to how we handle COPY.  These
 goals may be in conflict; we'll have to pick something.

Note that parsing COPYs is a major PITA from most languages...

Perhaps we should make the default output json instead? With every
action terminated by a nullbyte?
That's probably easier to parse from various scripting languages than
anything else.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Andres Freund
On 2013-10-15 10:34:53 -0400, Robert Haas wrote:
 On Tue, Oct 15, 2013 at 10:27 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  I think part of the problem may be that you're using the library name
  to identify the output plugin.  I'm not excited about that design.
  For functions, you give the function a name and that is a pointer to
  where to actually find the function, which may be a 2-tuple
  library-name, function-name, or perhaps just a 1-tuple
  builtin-function-name, or maybe the whole text of a PL/pgsql
  procedure that should be compiled.
 
  That means you allow trivial remote code execution since you could try
  to load system() or something else that's available in every shared
  object. Now you can argue that that's OK since we have special checks
  for replication connections, but I'd rather not go there.
 
 Well, obviously you can't let somebody load any library they want.
 But that's pretty much true anyway; LOAD had better be confined to
 superusers unless there is something (like a pg_proc entry) that
 provides prior authorization for that specific load.

Currently you can create users that have permissions for replication but
which are not superusers. I think we should strive to providing that
capability for changeset extraction as well.

  Perhaps this ought to work similarly.  Create a function in pg_proc
  which returns the structure containing the function pointers.  Then,
  when that output plugin is selected, it'll automatically trigger
  loading the correct shared library if that's needed; and the shared
  library name may (but need not) match the output plugin name.
 
  I'd like to avoid relying on inserting stuff into pg_proc because that
  makes it harder to extract WAL from a HS standby. Requiring to configure
  that on the primary to extract data on the standby seems confusing to
  me.
 
  But perhaps that's the correct solution :/
 
 That's a reasonable concern.  I don't have another idea at the moment,
 unless we want to allow replication connections to issue LOAD
 commands.  Then you can LOAD the library, so that the plug-in is
 registered under the well-known name you expect it to have, and then
 use that name to start replication.

But what's the advantage of that over the current situation or one where
PG_load_output_plugin() is called? The current and related
implementations allow you to only load libraries in some designated
postgres directories and it doesn't allow you to call any arbitrary
functions in there.

Would you be content with a symbol PG_load_output_plugin being called
that fills out the actual callbacks?

  Now, some users are still going to head for the hills.  But at least
  from where I sit it sounds a hell of a lot better than the first
  answer.  We're not going to solve all of the tooling problems around
  this technology in one release, for sure.  But as far as 95% of our
  users are concerned, a C API might as well not exist at all.  People
  WILL try to machine parse the output of whatever demo plugins we
  provide; so I think we should try hard to provide at least one such
  plugin that is designed to make that as easy as possible.
 
  Well, just providing the C API + an example in a first step didn't work
  out too badly for FDWs. I am pretty sure that once released there will
  soon be extensions for it on PGXN or whatever for special usecases.
 
 I suspect so, too.  But I also think that if that's the only thing
 available in the first release, a lot of users will get a poor initial
 impression.

I think lots of people will expect a builtin logical replication
solution :/. Which seems a tad unlikely to arrive in 9.4.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Robert Haas
On Tue, Oct 15, 2013 at 10:56 AM, Andres Freund and...@2ndquadrant.com wrote:
  That means you allow trivial remote code execution since you could try
  to load system() or something else that's available in every shared
  object. Now you can argue that that's OK since we have special checks
  for replication connections, but I'd rather not go there.

 Well, obviously you can't let somebody load any library they want.
 But that's pretty much true anyway; LOAD had better be confined to
 superusers unless there is something (like a pg_proc entry) that
 provides prior authorization for that specific load.

 Currently you can create users that have permissions for replication but
 which are not superusers. I think we should strive to providing that
 capability for changeset extraction as well.

I agree.

  Perhaps this ought to work similarly.  Create a function in pg_proc
  which returns the structure containing the function pointers.  Then,
  when that output plugin is selected, it'll automatically trigger
  loading the correct shared library if that's needed; and the shared
  library name may (but need not) match the output plugin name.
 
  I'd like to avoid relying on inserting stuff into pg_proc because that
  makes it harder to extract WAL from a HS standby. Requiring to configure
  that on the primary to extract data on the standby seems confusing to
  me.
 
  But perhaps that's the correct solution :/

 That's a reasonable concern.  I don't have another idea at the moment,
 unless we want to allow replication connections to issue LOAD
 commands.  Then you can LOAD the library, so that the plug-in is
 registered under the well-known name you expect it to have, and then
 use that name to start replication.

 But what's the advantage of that over the current situation or one where
 PG_load_output_plugin() is called? The current and related
 implementations allow you to only load libraries in some designated
 postgres directories and it doesn't allow you to call any arbitrary
 functions in there.

Well, I've already said why I don't like conflating the library name
and the plugin name.  It rules out core plugins and libraries that
provide multiple plugins.  I don't have anything to add to that.

 Would you be content with a symbol PG_load_output_plugin being called
 that fills out the actual callbacks?

Well, it doesn't fix the muddling of library names with output plugin
names, but I suppose I'd find it a modest improvement.

  Well, just providing the C API + an example in a first step didn't work
  out too badly for FDWs. I am pretty sure that once released there will
  soon be extensions for it on PGXN or whatever for special usecases.

 I suspect so, too.  But I also think that if that's the only thing
 available in the first release, a lot of users will get a poor initial
 impression.

 I think lots of people will expect a builtin logical replication
 solution :/. Which seems a tad unlikely to arrive in 9.4.

Yep.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread k...@rice.edu
On Tue, Oct 15, 2013 at 11:02:39AM -0400, Robert Haas wrote:
  goals may be in conflict; we'll have to pick something.
 
  Note that parsing COPYs is a major PITA from most languages...
 
  Perhaps we should make the default output json instead? With every
  action terminated by a nullbyte?
  That's probably easier to parse from various scripting languages than
  anything else.
 
 I could go for that.  It's not quite as compact as I might hope, but
 JSON does seem to make people awfully happy.
 
 -- 
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company
 

Feeding such a JSON stream into a compression algorithm like lz4 or
snappy should result in a pretty compact stream. The latest lz4 updates
also have ability to use a pre-existing dictionary which would really
help remove the redundant pieces.

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Andres Freund
On 2013-10-15 11:02:39 -0400, Robert Haas wrote:
  If the plugin interface isn't rich enough to provide a convenient way
  to avoid that, then it needs to be fixed so that it is, because it
  will be a common requirement.
 
  Oh, it surely is possibly to avoid repeating it. The output plugin
  interface simply gives you a relcache entry, that contains everything
  necessary.
  The output plugin would need to keep track of whether it has output data
  for a specific relation and it would need to check whether the table
  definition has changed, but I don't see how we could avoid that?
 
 Well, it might be nice if there were a callback for, hey, schema has
 changed!  Seems like a lot of plugins will want to know that for one
 reason or another, and rechecking for every tuple sounds expensive.

I don't really see how we could provide that in any useful manner. We
could provide a callback that is called whenever another transaction has
changed the schema, but there's nothing easily to be done about schema
changes by the replayed transaction itself. And those are the only ones
where meaningful schema changes can happen since the locks the source
transaction has held will prevent most other schema changes.

As much as I hate such code, I guess checking (and possibly storing) the
ctid||xmin of the pg_class row is the easiest thing we could do :(.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Robert Haas
On Tue, Oct 15, 2013 at 10:48 AM, Andres Freund and...@2ndquadrant.com wrote:
  What about columns like:
  * action B|I|U|D|C

 BEGIN and COMMIT?

 That's B and C, yes. You'd rather not have them? When would you replay
 the commit without an explicit message telling you to?

No, BEGIN and COMMIT sounds good, actually.  Just wanted to make sure
I understood.

 Repeating the column names for every row strikes me as a nonstarter.
 [...]
 Sure, some people may want JSON or XML
 output that reiterates the labels every time, but for a lot of people
 that's going to greatly increase the size of the output and be
 undesirable for that reason.

 But I argue that most simpler users - which are exactly the ones a
 generic output plugin is aimed at - will want all column names since it
 makes replay far easier.

Meh, maybe.

 If the plugin interface isn't rich enough to provide a convenient way
 to avoid that, then it needs to be fixed so that it is, because it
 will be a common requirement.

 Oh, it surely is possibly to avoid repeating it. The output plugin
 interface simply gives you a relcache entry, that contains everything
 necessary.
 The output plugin would need to keep track of whether it has output data
 for a specific relation and it would need to check whether the table
 definition has changed, but I don't see how we could avoid that?

Well, it might be nice if there were a callback for, hey, schema has
changed!  Seems like a lot of plugins will want to know that for one
reason or another, and rechecking for every tuple sounds expensive.

  What still need to be determined is:
  * how do we separate and escape multiple values in one CSV column
  * how do we represent NULLs

 I consider the escaping a key design decision.  Ideally, it should be
 something that's easy to reverse from a scripting language; ideally
 also, it should be something similar to how we handle COPY.  These
 goals may be in conflict; we'll have to pick something.

 Note that parsing COPYs is a major PITA from most languages...

 Perhaps we should make the default output json instead? With every
 action terminated by a nullbyte?
 That's probably easier to parse from various scripting languages than
 anything else.

I could go for that.  It's not quite as compact as I might hope, but
JSON does seem to make people awfully happy.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Josh Berkus
On 10/15/2013 07:56 AM, Andres Freund wrote:
 Well, just providing the C API + an example in a first step didn't work
   out too badly for FDWs. I am pretty sure that once released there will
   soon be extensions for it on PGXN or whatever for special usecases.
  
  I suspect so, too.  But I also think that if that's the only thing
  available in the first release, a lot of users will get a poor initial
  impression.
 I think lots of people will expect a builtin logical replication
 solution :/. Which seems a tad unlikely to arrive in 9.4.

Well, last I checked the Slony team is hard at work on building
something which will be based on logical changesets.  So there will
likely be at least one tool available shortly after 9.4 is released.

A good and flexible API is, IMHO, more important than having any
finished solution.  The whole reason why logical replication was outside
core PG for so long is that replication systems have differing and
mutually incompatible goals.  A good API can support all of those goals;
a user-level tool, no matter how good, can't.

And, frankly, once the API is built, how hard will it be to write a
script which does the simplest replication approach (replay all
statements on slave)?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread David Fetter
On Tue, Oct 15, 2013 at 10:09:05AM -0400, Robert Haas wrote:
 On Tue, Oct 15, 2013 at 9:17 AM, Andres Freund and...@2ndquadrant.com wrote:
 User: So, what's new in PostgreSQL 9.4?
 Hacker: Well, now we have logical replication!
 User: Why is that cool?
 Hacker: Well, streaming replication is awesome for HA, but it has
 significant limitations.  And trigger-based systems are very mature,
 but the overhead is high and their lack of core integration makes them
 hard to use.  With this technology, you can build systems that will
 replicate individual tables or even parts of tables, multi-master
 systems, and lots of other cool stuff.
 User: Wow, that sounds great.  How do I use it?
 Hacker: Well, first you write an output plugin in C using a special API.
 User: Hey, do you know whether the MongoDB guys came to this conference?
 
 Let's try that again.
 
 User: Wow, that sounds great.  How do I use it?
 Hacker: Well, currently, the output gets dumped as a series of text
 files that are designed to be parsed using a scripting language.  We
 have sample parsers written in Perl and Python that you can use as-is
 or hack up to meet your needs.

My version:

Hacker: the output gets dumped as a series of JSON files.  We have
docs for this rev of the format and examples of consumers in Perl and
Python you can use as-is or hack up to meet your needs.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-15 Thread Peter Geoghegan
On Tue, Oct 15, 2013 at 7:09 AM, Robert Haas robertmh...@gmail.com wrote:
 Let's try that again.

 User: Wow, that sounds great.  How do I use it?
 Hacker: Well, currently, the output gets dumped as a series of text
 files that are designed to be parsed using a scripting language.  We
 have sample parsers written in Perl and Python that you can use as-is
 or hack up to meet your needs.

Have you heard of multicorn? Plugin authors can write a wrapper that
spits out JSON or whatever other thing they like, which can be
consumed by non C-hackers.

 Now, some users are still going to head for the hills.  But at least
 from where I sit it sounds a hell of a lot better than the first
 answer.  We're not going to solve all of the tooling problems around
 this technology in one release, for sure.  But as far as 95% of our
 users are concerned, a C API might as well not exist at all.  People
 WILL try to machine parse the output of whatever demo plugins we
 provide; so I think we should try hard to provide at least one such
 plugin that is designed to make that as easy as possible.

I agree that this is important, but I wouldn't like to weigh it too heavily.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-14 Thread Robert Haas
On Fri, Oct 11, 2013 at 12:57 PM, Andres Freund and...@2ndquadrant.com wrote:
 I don't see any need for SQL syntax.  I was just thinking that the
 _PG_init function could fill in a structure and then call
 RegisterLogicalReplicationOutputPlugin(mystruct).

 Hm. We can do that, but what'd be the advantage of that? The current
 model will correctly handle things like a'shared_preload_libraries'ed
 output plugin, because its _PG_init() will not register it. With the
 handling done in _PG_init() there would be two.
 Being able to use the same .so for output plugin handling and some other
 replication solution specific stuff is imo useful.

Well, I just think relying on specific symbol names in the .so file is
kind of unfortunate.  It means that, for example, you can't have
multiple output plugins provided by a single .so.  And in general I
think it's something that we've tried to minimize.

 I don't see why you're so pessimistic about that.  I know you haven't
 worked it out yet, but what makes this harder than sitting down and
 designing something?

 Because every replication solution has different requirements for the
 format and they will want filter the output stream with regard to their
 own configuration.
 E.g. bucardo will want to include the transaction timestamp for conflict
 resolution and such.

But there's only so much information available here.  Why not just
have a format that logs it all?

 Sure, that's no problem. Do I understand correctly that you'd like
 wal_decoding: Add information about a tables primary key to struct 
 RelationData
 wal_decoding: Add wal_level = logical and log data required for logical 
 decoding

 earlier?

Yes.

  I'd really like to do so. I am travelling atm, but I will be back
  tomorrow evening and will push an updated patch this weekend. The issue
  I know of in the latest patches at
  http://www.postgresql.org/message-id/20131007133232.ga15...@awork2.anarazel.de
  is renaming from 
  http://www.postgresql.org/message-id/20131008194758.gb3718...@alap2.anarazel.de

 I'm a bit nervous about the way the combo CID logging.  I would have
 thought that you would emit one record per combo CID, but what you're
 apparently doing is emitting one record per heap tuple that uses a
 combo CID.

 I thought and implemented that in the beginning. Unfortunately it's not
 enough :(. That's probably the issue that took me longest to understand
 in this patchseries...

 Combocids can only fix the case where a transaction actually has create
 a combocid:

 1) TX1: INSERT id = 1 at 0/1: (xmin = 1, xmax=Invalid, cmin = 55, cmax = 
 Invalid)
 2) TX2: DELETE id = 1 at 0/1: (xmin = 1, xmax=2, cmin = Invalid, cmax = 1)

 So, if we're decoding data that needs to lookup those rows in TX1 or TX2
 we both times need access to cmin and cmax, but neither transaction will
 have created a multixact. That can only be an issue in transaction with
 catalog modifications.

Oh, yuck.  So that means you have to write an extra WAL record for
EVERY heap insert, update, or delete to a catalog table?  OUCH.

 Couldn't measure anything either, which is not surprising that I
 couldn't measure the overhead in the first place.

 I've done some parallel INSERT/DELETE pgbenching around the
 wal_level=logical and I couldn't measure any overhead with it
 disabled. With wal_level = logical, UPDATEs and DELETEs do get a bit
 slower, but that's to be expected.

 It'd probably not hurt to redo those benchmarks to make sure...

Yes, I think it would be good to characterize it more precisely than
a bit, so people know what to expect.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-14 Thread Andres Freund
On 2013-10-14 09:36:03 -0400, Robert Haas wrote:
 On Fri, Oct 11, 2013 at 12:57 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  I don't see any need for SQL syntax.  I was just thinking that the
  _PG_init function could fill in a structure and then call
  RegisterLogicalReplicationOutputPlugin(mystruct).
 
  Hm. We can do that, but what'd be the advantage of that? The current
  model will correctly handle things like a'shared_preload_libraries'ed
  output plugin, because its _PG_init() will not register it. With the
  handling done in _PG_init() there would be two.
  Being able to use the same .so for output plugin handling and some other
  replication solution specific stuff is imo useful.
 
 Well, I just think relying on specific symbol names in the .so file is
 kind of unfortunate.  It means that, for example, you can't have
 multiple output plugins provided by a single .so.  And in general I
 think it's something that we've tried to minimize.

But that's not really different when you rely on _PG_init doing it's
thing, right?

  I don't see why you're so pessimistic about that.  I know you haven't
  worked it out yet, but what makes this harder than sitting down and
  designing something?
 
  Because every replication solution has different requirements for the
  format and they will want filter the output stream with regard to their
  own configuration.
  E.g. bucardo will want to include the transaction timestamp for conflict
  resolution and such.
 
 But there's only so much information available here.  Why not just
 have a format that logs it all?

Because we do not know what all is? Also, how would we handle
replication sets and such that all of the existing replication solutions
have generically?

  Sure, that's no problem. Do I understand correctly that you'd like
  wal_decoding: Add information about a tables primary key to struct 
  RelationData
  wal_decoding: Add wal_level = logical and log data required for logical 
  decoding
 
  earlier?
 
 Yes.

That's done. Hope the new order makes sense.

  So, if we're decoding data that needs to lookup those rows in TX1 or TX2
  we both times need access to cmin and cmax, but neither transaction will
  have created a multixact. That can only be an issue in transaction with
  catalog modifications.

 Oh, yuck.  So that means you have to write an extra WAL record for
 EVERY heap insert, update, or delete to a catalog table?  OUCH.

Yes. We could integrate it into the main record without too many
problems, but it didn't seem like an important optimization and it would
have higher chances of slowing down wal_level  logical.

  It'd probably not hurt to redo those benchmarks to make sure...
 
 Yes, I think it would be good to characterize it more precisely than
 a bit, so people know what to expect.

A bit was below the 3% range for loops of adding columns.

So, any tests you'd like to see?
* loop around CREATE TABLE/DROP TABLE
* loop around ALTER TABLE ... ADD COLUMN
* loop around CREATE FUNCTION/DROP FUNCTION

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-14 Thread Andres Freund
On 2013-10-14 15:51:14 +0200, Andres Freund wrote:
   It'd probably not hurt to redo those benchmarks to make sure...
  
  Yes, I think it would be good to characterize it more precisely than
  a bit, so people know what to expect.
 
 A bit was below the 3% range for loops of adding columns.
 
 So, any tests you'd like to see?
 * loop around CREATE TABLE/DROP TABLE
 * loop around ALTER TABLE ... ADD COLUMN
 * loop around CREATE FUNCTION/DROP FUNCTION

So, see the attatched benchmark skript. I've always done using a disk
bound and a memory bound (using eatmydata, preventing fsyncs) run.

* unpatched run, wal_level = hot_standby, eatmydata
* unpatched run, wal_level = hot_standby

* patched run, wal_level = hot_standby, eatmydata
* patched run, wal_level = hot_standby

* patched run, wal_level = logical, eatmydata
* patched run, wal_level = logical

Based on those results, there's no difference above noise for
wal_level=hot_standby, with or without the patch. With wal_level=logical
there's a measurable increase in wal traffic (~12-17%), but no
performance decrease above noise.

From my POV that's ok, those are really crazy catalog workloads.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
$ eatmydata /tmp/bench.sh
createtable single
tps = 320.293146 (including connections establishing)
tps = 320.334981 (excluding connections establishing)
262152  /tmp/perf/pg_xlog
createtable parallel
tps = 1516.095491 (including connections establishing)
tps = 1516.488664 (excluding connections establishing)
1163272 /tmp/perf/pg_xlog
altertable single
tps = 139.474141 (including connections establishing)
tps = 139.491721 (excluding connections establishing)
163848  /tmp/perf/pg_xlog
altertable parallel
tps = 774.868029 (including connections establishing)
tps = 775.136842 (excluding connections establishing)
851976  /tmp/perf/pg_xlog
createfunction single
tps = 2938.380801 (including connections establishing)
tps = 2938.711692 (excluding connections establishing)
81928   /tmp/perf/pg_xlog
createfunction parallel
tps = 20603.023567 (including connections establishing)
tps = 20608.799987 (excluding connections establishing)
458760  /tmp/perf/pg_xlog

$ /tmp/bench.sh
createtable single
tps = 51.014096 (including connections establishing)
tps = 51.020329 (excluding connections establishing)
49160   /tmp/perf/pg_xlog
createtable parallel
tps = 171.012045 (including connections establishing)
tps = 171.054406 (excluding connections establishing)
147464  /tmp/perf/pg_xlog
altertable single
tps = 9.138758 (including connections establishing)
tps = 9.139863 (excluding connections establishing)
32776   /tmp/perf/pg_xlog
altertable parallel
tps = 45.109269 (including connections establishing)
tps = 45.122066 (excluding connections establishing)
65544   /tmp/perf/pg_xlog
createfunction single
tps = 131.192719 (including connections establishing)
tps = 131.209112 (excluding connections establishing)
16392   /tmp/perf/pg_xlog
createfunction parallel
tps = 624.830173 (including connections establishing)
tps = 625.017525 (excluding connections establishing)
32776   /tmp/perf/pg_xlog

-- patch applied --
$ eatmydata /tmp/bench.sh
createtable single
tps = 329.063474 (including connections establishing)
tps = 329.104147 (excluding connections establishing)
262152  /tmp/perf/pg_xlog
createtable parallel
tps = 1462.524932 (including connections establishing)
tps = 1462.872552 (excluding connections establishing)
1130504 /tmp/perf/pg_xlog
altertable single
tps = 141.091905 (including connections establishing)
tps = 141.108352 (excluding connections establishing)
163848  /tmp/perf/pg_xlog
altertable parallel
tps = 812.810544 (including connections establishing)
tps = 813.026734 (excluding connections establishing)
901128  /tmp/perf/pg_xlog
createfunction single
tps = 3384.068190 (including connections establishing)
tps = 3384.460953 (excluding connections establishing)
81928   /tmp/perf/pg_xlog
createfunction parallel
tps = 20744.363972 (including connections establishing)
tps = 20750.135978 (excluding connections establishing)
458760  /tmp/perf/pg_xlog

$ /tmp/bench.sh
createtable single
tps = 42.522505 (including connections establishing)
tps = 42.527233 (excluding connections establishing)
49160   /tmp/perf/pg_xlog
createtable parallel
tps = 148.753762 (including connections establishing)
tps = 148.794361 (excluding connections establishing)
131080  /tmp/perf/pg_xlog
altertable single
tps = 9.524616 (including connections establishing)
tps = 9.525757 (excluding connections establishing)
32776   /tmp/perf/pg_xlog
altertable parallel
tps = 49.209278 (including connections establishing)
tps = 49.223312 (excluding connections establishing)
65544   /tmp/perf/pg_xlog
createfunction single
tps = 132.325526 (including connections establishing)
tps = 132.340677 (excluding connections establishing)
16392   /tmp/perf/pg_xlog
createfunction parallel

Re: [HACKERS] logical changeset generation v6.2

2013-10-11 Thread Robert Haas
On Thu, Oct 10, 2013 at 7:11 PM, Andres Freund and...@2ndquadrant.com wrote:
 Hi Robert,

 On 2013-10-09 14:49:46 -0400, Robert Haas wrote:
 I spent some time looking at the sample plugin (patch 9/12).  Here are
 some review comments:

 - I think that the decoding plugin interface should work more like the
 foreign data wrapper interface.  Instead of using pg_dlsym to look up
 fixed names, I think there should be a struct of function pointers
 that gets filled in and registered somehow.

 You mean something like CREATE OUTPUT PLUGIN registering a function with
 an INTERNAL return value returning a filled struct? I thought about
 that, but it seemed more complex. Happy to change it though if it's
 preferred.

I don't see any need for SQL syntax.  I was just thinking that the
_PG_init function could fill in a structure and then call
RegisterLogicalReplicationOutputPlugin(mystruct).

 - Still wondering how we'll use this from a bgworker.

 Simplified code to consume data:

Cool.  As long as that use case is supported, I'm happy; I just want
to make sure we're not presuming that there must be an external
client.

 - The output format doesn't look very machine-parseable.   I really
 think we ought to provide something that is.  Maybe a CSV-like format,
 or maybe something else, but I don't see why someone who wants to do
 change logging should be forced to write and install C code.  If
 something like Bucardo can run on an unmodified system and extract
 change-sets this way without needing a .so file, that's going to be a
 huge win for usability.

 We can change the current format but I really see little to no chance of
 agreeing on a replication format that's serviceable to several solutions
 short term. Once we've gained some experience - maybe even this cycle -
 that might be different.

I don't see why you're so pessimistic about that.  I know you haven't
worked it out yet, but what makes this harder than sitting down and
designing something?

 More generally on this patch set, if I'm going to be committing any of
 this, I'd prefer to start with what is currently patches 3 and 4, once
 we reach agreement on those.

 Sounds like a reasonable start.

Perhaps you could reshuffle the order of the series, if it's not too much work.

 Are we hoping to get any of this committed for this CF?  If so, let's
 make a plan to get that done; time is short.  If not, let's update the
 CF app accordingly.

 I'd really like to do so. I am travelling atm, but I will be back
 tomorrow evening and will push an updated patch this weekend. The issue
 I know of in the latest patches at
 http://www.postgresql.org/message-id/20131007133232.ga15...@awork2.anarazel.de
 is renaming from 
 http://www.postgresql.org/message-id/20131008194758.gb3718...@alap2.anarazel.de

I'm a bit nervous about the way the combo CID logging.  I would have
thought that you would emit one record per combo CID, but what you're
apparently doing is emitting one record per heap tuple that uses a
combo CID.  For some reason that feels like an abuse (and maybe kinda
inefficient, too).

Either way, I also wonder what happens if a (logical?) checkpoint
occurs between the combo CID record and the heap record to which it
refers, or how you prevent that from happening.  What if the combo CID
record is written and the transaction aborts before writing the heap
record (maybe without writing an abort record to WAL)?

What are the performance implications of this additional WAL logging?
What's the worst case?  What's the typical case?  Does it have a
noticeable overhead when wal_level  logical?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-11 Thread Andres Freund
Hi,

On 2013-10-11 09:08:43 -0400, Robert Haas wrote:
 On Thu, Oct 10, 2013 at 7:11 PM, Andres Freund and...@2ndquadrant.com wrote:
  On 2013-10-09 14:49:46 -0400, Robert Haas wrote:
  I spent some time looking at the sample plugin (patch 9/12).  Here are
  some review comments:
 
  - I think that the decoding plugin interface should work more like the
  foreign data wrapper interface.  Instead of using pg_dlsym to look up
  fixed names, I think there should be a struct of function pointers
  that gets filled in and registered somehow.
 
  You mean something like CREATE OUTPUT PLUGIN registering a function with
  an INTERNAL return value returning a filled struct? I thought about
  that, but it seemed more complex. Happy to change it though if it's
  preferred.
 
 I don't see any need for SQL syntax.  I was just thinking that the
 _PG_init function could fill in a structure and then call
 RegisterLogicalReplicationOutputPlugin(mystruct).

Hm. We can do that, but what'd be the advantage of that? The current
model will correctly handle things like a'shared_preload_libraries'ed
output plugin, because its _PG_init() will not register it. With the
handling done in _PG_init() there would be two.
Being able to use the same .so for output plugin handling and some other
replication solution specific stuff is imo useful.

  - Still wondering how we'll use this from a bgworker.
 
  Simplified code to consume data:
 
 Cool.  As long as that use case is supported, I'm happy; I just want
 to make sure we're not presuming that there must be an external
 client.

The included testcases are written using the SQL SRF interface, which in
turn is a usecase that doesn't use walsenders and such, so I hope we
won't break it accidentally ;)

  - The output format doesn't look very machine-parseable.   I really
  think we ought to provide something that is.  Maybe a CSV-like format,
  or maybe something else, but I don't see why someone who wants to do
  change logging should be forced to write and install C code.  If
  something like Bucardo can run on an unmodified system and extract
  change-sets this way without needing a .so file, that's going to be a
  huge win for usability.
 
  We can change the current format but I really see little to no chance of
  agreeing on a replication format that's serviceable to several solutions
  short term. Once we've gained some experience - maybe even this cycle -
  that might be different.
 
 I don't see why you're so pessimistic about that.  I know you haven't
 worked it out yet, but what makes this harder than sitting down and
 designing something?

Because every replication solution has different requirements for the
format and they will want filter the output stream with regard to their
own configuration.
E.g. bucardo will want to include the transaction timestamp for conflict
resolution and such.

  More generally on this patch set, if I'm going to be committing any of
  this, I'd prefer to start with what is currently patches 3 and 4, once
  we reach agreement on those.
 
  Sounds like a reasonable start.
 
 Perhaps you could reshuffle the order of the series, if it's not too much 
 work.

Sure, that's no problem. Do I understand correctly that you'd like
wal_decoding: Add information about a tables primary key to struct RelationData
wal_decoding: Add wal_level = logical and log data required for logical decoding

earlier?

  I'd really like to do so. I am travelling atm, but I will be back
  tomorrow evening and will push an updated patch this weekend. The issue
  I know of in the latest patches at
  http://www.postgresql.org/message-id/20131007133232.ga15...@awork2.anarazel.de
  is renaming from 
  http://www.postgresql.org/message-id/20131008194758.gb3718...@alap2.anarazel.de
 
 I'm a bit nervous about the way the combo CID logging.  I would have
 thought that you would emit one record per combo CID, but what you're
 apparently doing is emitting one record per heap tuple that uses a
 combo CID.

I thought and implemented that in the beginning. Unfortunately it's not
enough :(. That's probably the issue that took me longest to understand
in this patchseries...

Combocids can only fix the case where a transaction actually has create
a combocid:

1) TX1: INSERT id = 1 at 0/1: (xmin = 1, xmax=Invalid, cmin = 55, cmax = 
Invalid)
2) TX2: DELETE id = 1 at 0/1: (xmin = 1, xmax=2, cmin = Invalid, cmax = 1)

So, if we're decoding data that needs to lookup those rows in TX1 or TX2
we both times need access to cmin and cmax, but neither transaction will
have created a multixact. That can only be an issue in transaction with
catalog modifications.

A slightly more complex variant also requires this if combocids are
involved:

1) TX1: INSERT id = 1 at 0/1: (xmin = 1, xmax=Invalid, cmin = 55, cmax = 
Invalid)
2) TX1: SAVEPOINT foo;
3) TX1-2: UPDATE id = 1 at 0/1: (xmin = 1, xmax=2, cmin = 55, cmax = 56, 
combo=123)
new at 0/1: (xmin = 2, xmax=Invalid, cmin = 57, cmax = 

Re: [HACKERS] logical changeset generation v6.2

2013-10-10 Thread Andres Freund
Hi Robert,

On 2013-10-09 14:49:46 -0400, Robert Haas wrote:
 I spent some time looking at the sample plugin (patch 9/12).  Here are
 some review comments:
 
 - I think that the decoding plugin interface should work more like the
 foreign data wrapper interface.  Instead of using pg_dlsym to look up
 fixed names, I think there should be a struct of function pointers
 that gets filled in and registered somehow.

You mean something like CREATE OUTPUT PLUGIN registering a function with
an INTERNAL return value returning a filled struct? I thought about
that, but it seemed more complex. Happy to change it though if it's
preferred.

 - pg_decode_init() only warns when it encounters an unknown option.
 An error seems more appropriate.

Fine with me. I think I just made it a warning because I wanted to
experiment with options.

 - Still wondering how we'll use this from a bgworker.

Simplified code to consume data:

LogicalDecodingReAcquireSlot(NameStr(*name));
ctx = CreateLogicalDecodingContext(MyLogicalDecodingSlot, false /* not initial 
call */,
   MyLogicalDecodingSlot-confirmed_flush,
   options,
   logical_read_local_xlog_page,
   LogicalOutputPrepareWrite,
   LogicalOutputWrite);
...
while (true)
{
XLogRecord *record;
char   *errm = NULL;

record = XLogReadRecord(ctx-reader, startptr, errm);
...
DecodeRecordIntoReorderBuffer(ctx, buf);
}

/* at the end or better ever commit or such */
LogicalConfirmReceivedLocation(/* whatever you consumed */);

LogicalDecodingReleaseSlot();


 - The output format doesn't look very machine-parseable.   I really
 think we ought to provide something that is.  Maybe a CSV-like format,
 or maybe something else, but I don't see why someone who wants to do
 change logging should be forced to write and install C code.  If
 something like Bucardo can run on an unmodified system and extract
 change-sets this way without needing a .so file, that's going to be a
 huge win for usability.

We can change the current format but I really see little to no chance of
agreeing on a replication format that's serviceable to several solutions
short term. Once we've gained some experience - maybe even this cycle -
that might be different.

 More generally on this patch set, if I'm going to be committing any of
 this, I'd prefer to start with what is currently patches 3 and 4, once
 we reach agreement on those.

Sounds like a reasonable start.

 Are we hoping to get any of this committed for this CF?  If so, let's
 make a plan to get that done; time is short.  If not, let's update the
 CF app accordingly.

I'd really like to do so. I am travelling atm, but I will be back
tomorrow evening and will push an updated patch this weekend. The issue
I know of in the latest patches at
http://www.postgresql.org/message-id/20131007133232.ga15...@awork2.anarazel.de
is renaming from 
http://www.postgresql.org/message-id/20131008194758.gb3718...@alap2.anarazel.de

Do you know of anything else in the patches you're referring to?

Thanks,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-09 Thread Robert Haas
On Mon, Sep 30, 2013 at 6:44 PM, Andres Freund and...@2ndquadrant.com wrote:
 The series from friday was a bit too buggy - obviously I was too
 tired. So here's a new one:

 * fix pg_recvlogical makefile (Thanks Steve)
 * fix two commits not compiling properly without later changes (Thanks Kevin)
 * keep track of commit timestamps
 * fix bugs with option passing in test_logical_decoding
 * actually parse option values in test_decoding instead of just using the
   option name
 * don't use anonymous structs in unions. That's compiler specific (msvc
   and gcc) before C11 on which we can't rely. That unfortunately will
   break output plugins because ReorderBufferChange need to qualify
   old/new tuples now
 * improve error handling/cleanup in test_logical_decoding
 * some minor cleanups

 Patches attached, git tree updated.

I spent some time looking at the sample plugin (patch 9/12).  Here are
some review comments:

- I think that the decoding plugin interface should work more like the
foreign data wrapper interface.  Instead of using pg_dlsym to look up
fixed names, I think there should be a struct of function pointers
that gets filled in and registered somehow.

- pg_decode_init() only warns when it encounters an unknown option.
An error seems more appropriate.

- Still wondering how we'll use this from a bgworker.

- The output format doesn't look very machine-parseable.   I really
think we ought to provide something that is.  Maybe a CSV-like format,
or maybe something else, but I don't see why someone who wants to do
change logging should be forced to write and install C code.  If
something like Bucardo can run on an unmodified system and extract
change-sets this way without needing a .so file, that's going to be a
huge win for usability.

Other than that, I don't have too many concerns about the plugin
interface.  I think it provides useful flexibility and it generally
seems well-designed.  I hope in the future we'll be able to decode
transactions on the fly instead of waiting until commit time, but I've
resigned myself to the fact that we may not get that in version one.

More generally on this patch set, if I'm going to be committing any of
this, I'd prefer to start with what is currently patches 3 and 4, once
we reach agreement on those.

Are we hoping to get any of this committed for this CF?  If so, let's
make a plan to get that done; time is short.  If not, let's update the
CF app accordingly.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-07 Thread Steve Singer

On 10/03/2013 04:00 PM, Andres Freund wrote:
Ok, there were a couple of bugs because I thought mxacts wouldn't need 
to be supported. So far your testcase doesn't crash the database 
anymore - it spews some internal errors though, so I am not sure if 
it's entirely fixed for you. Thanks for testing and helping! I've 
pushed the changes to the git tree, they aren't squashed yet and 
there's some further outstanding stuff, so I won't repost the series 
yet. Greetings, Andres Freund 
When I run your updated version (from friday, not what you posted today) 
against a more recent version of my slony changes I can get the test 
case to pass 2/3 'rd of the time.  The failures are due to an issue in 
slon itself that I need to fix.


I see lots of
0LOG:  tx with subtxn 58836

but they seem harmless.

Thanks



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-07 Thread Andres Freund
On 2013-10-07 09:56:11 -0400, Steve Singer wrote:
 On 10/03/2013 04:00 PM, Andres Freund wrote:
 Ok, there were a couple of bugs because I thought mxacts wouldn't need to
 be supported. So far your testcase doesn't crash the database anymore - it
 spews some internal errors though, so I am not sure if it's entirely fixed
 for you. Thanks for testing and helping! I've pushed the changes to the
 git tree, they aren't squashed yet and there's some further outstanding
 stuff, so I won't repost the series yet. Greetings, Andres Freund
 When I run your updated version (from friday, not what you posted today)
 against a more recent version of my slony changes I can get the test case to
 pass 2/3 'rd of the time.  The failures are due to an issue in slon itself
 that I need to fix.

Cool.

 I see lots of
 0LOG:  tx with subtxn 58836

Yes, those are completely harmless. And should, in fact, be removed. I
guess I should add the todo entry:
* make a pass over all elog/ereport an make sure they have the correct
  log level et al.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-03 Thread Andres Freund
On 2013-10-01 16:11:47 -0400, Steve Singer wrote:
 On 09/30/2013 06:44 PM, Andres Freund wrote:
 Hi,
 
 The series from friday was a bit too buggy - obviously I was too
 tired. So here's a new one:
 
 With this series I've also noticed
 #2  0x007741a7 in ExceptionalCondition (
 conditionName=conditionName@entry=0x7c2908 !(!(tuple-t_infomask 
 0x1000)), errorType=errorType@entry=0x7acc70 FailedAssertion,
 fileName=fileName@entry=0x91767e tqual.c,
 lineNumber=lineNumber@entry=1608) at assert.c:54
 54abort();
 
 
  0x007a4432 in HeapTupleSatisfiesMVCCDuringDecoding (
 htup=0x10bfe48, snapshot=0x108b3d8, buffer=310) at tqual.c:1608
 #4  0x0049d6b7 in heap_hot_search_buffer (tid=tid@entry=0x10bfe4c,
 relation=0x7fbebbcd89c0, buffer=310, snapshot=0x10bfda0,
 heapTuple=heapTuple@entry=0x10bfe48,
 all_dead=all_dead@entry=0x7fff4aa3866f \001\370\375\v\001,
 first_call=1 '\001') at heapam.c:1756
 #5  0x004a8174 in index_fetch_heap (scan=scan@entry=0x10bfdf8)
 at indexam.c:539
 #6  0x004a82a8 in index_getnext (scan=0x10bfdf8,
 direction=direction@entry=ForwardScanDirection) at indexam.c:622
 #7  0x004a6fa9 in systable_getnext (sysscan=sysscan@entry=0x10bfd48)
 at genam.c:343
 #8  0x0076df40 in RelidByRelfilenode (reltablespace=0,
 relfilenode=529775) at relfilenodemap.c:214
 ---Type return to continue, or q return to quit---
 #9  0x00664ad7 in ReorderBufferCommit (rb=0x1082d98,
 xid=optimized out, commit_lsn=4638756800, end_lsn=optimized out,
 commit_time=commit_time@entry=433970378426176) at reorderbuffer.c:1320

Does your code use SELECT FOR UPDATE/SHARE on system or treat_as_catalog
tables?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-03 Thread Steve Singer

On 10/03/2013 12:38 PM, Andres Freund wrote:
Does your code use SELECT FOR UPDATE/SHARE on system or 
treat_as_catalog tables? Greetings, Andres Freund 


Yes.
It declares sl_table and sl_sequence and sl_set as catalog.

It does a
SELECT ..
from @NAMESPACE@.sl_table T, @NAMESPACE@.sl_set S,
pg_catalog.pg_class PGC, pg_catalog.pg_namespace PGN,
pg_catalog.pg_index PGX, pg_catalog.pg_class PGXC
where ... for update

in the code being executed by the 'set add table'.

(We also do select for update commands in many other places during 
cluster configuration commands)




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-03 Thread Andres Freund
On 2013-10-03 13:03:07 -0400, Steve Singer wrote:
 On 10/03/2013 12:38 PM, Andres Freund wrote:
 Does your code use SELECT FOR UPDATE/SHARE on system or treat_as_catalog
 tables? Greetings, Andres Freund
 
 Yes.
 It declares sl_table and sl_sequence and sl_set as catalog.
 
 It does a
 SELECT ..
 from @NAMESPACE@.sl_table T, @NAMESPACE@.sl_set S,
 pg_catalog.pg_class PGC, pg_catalog.pg_namespace PGN,
 pg_catalog.pg_index PGX, pg_catalog.pg_class PGXC
 where ... for update
 
 in the code being executed by the 'set add table'.
 
 (We also do select for update commands in many other places during cluster
 configuration commands)

Ok, there were a couple of bugs because I thought mxacts wouldn't need
to be supported. So far your testcase doesn't crash the database
anymore - it spews some internal errors though, so I am not sure if it's
entirely fixed for you.

Thanks for testing and helping!

I've pushed the changes to the git tree, they aren't squashed yet and
there's some further outstanding stuff, so I won't repost the series yet.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-01 Thread Steve Singer

On 09/30/2013 06:44 PM, Andres Freund wrote:

Hi,

The series from friday was a bit too buggy - obviously I was too
tired. So here's a new one:


With this series I've also noticed
#2  0x007741a7 in ExceptionalCondition (
conditionName=conditionName@entry=0x7c2908 !(!(tuple-t_infomask  
0x1000)), errorType=errorType@entry=0x7acc70 FailedAssertion,

fileName=fileName@entry=0x91767e tqual.c,
lineNumber=lineNumber@entry=1608) at assert.c:54
54abort();


 0x007a4432 in HeapTupleSatisfiesMVCCDuringDecoding (
htup=0x10bfe48, snapshot=0x108b3d8, buffer=310) at tqual.c:1608
#4  0x0049d6b7 in heap_hot_search_buffer (tid=tid@entry=0x10bfe4c,
relation=0x7fbebbcd89c0, buffer=310, snapshot=0x10bfda0,
heapTuple=heapTuple@entry=0x10bfe48,
all_dead=all_dead@entry=0x7fff4aa3866f \001\370\375\v\001,
first_call=1 '\001') at heapam.c:1756
#5  0x004a8174 in index_fetch_heap (scan=scan@entry=0x10bfdf8)
at indexam.c:539
#6  0x004a82a8 in index_getnext (scan=0x10bfdf8,
direction=direction@entry=ForwardScanDirection) at indexam.c:622
#7  0x004a6fa9 in systable_getnext (sysscan=sysscan@entry=0x10bfd48)
at genam.c:343
#8  0x0076df40 in RelidByRelfilenode (reltablespace=0,
relfilenode=529775) at relfilenodemap.c:214
---Type return to continue, or q return to quit---
#9  0x00664ad7 in ReorderBufferCommit (rb=0x1082d98,
xid=optimized out, commit_lsn=4638756800, end_lsn=optimized out,
commit_time=commit_time@entry=433970378426176) at reorderbuffer.c:1320

In addition to some of the other ones I've posted about.



* fix pg_recvlogical makefile (Thanks Steve)
* fix two commits not compiling properly without later changes (Thanks Kevin)
* keep track of commit timestamps
* fix bugs with option passing in test_logical_decoding
* actually parse option values in test_decoding instead of just using the
   option name
* don't use anonymous structs in unions. That's compiler specific (msvc
   and gcc) before C11 on which we can't rely. That unfortunately will
   break output plugins because ReorderBufferChange need to qualify
   old/new tuples now
* improve error handling/cleanup in test_logical_decoding
* some minor cleanups

Patches attached, git tree updated.

Greetings,

Andres Freund







--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers