Re: [HACKERS] logical changeset generation v6.2
On 2013-10-28 11:54:31 -0400, Robert Haas wrote: There's one snag I currently can see, namely that we actually need to prevent that a formerly dropped relfilenode is getting reused. Not entirely sure what the best way for that is. I'm not sure in detail, but it seems to me that this all part of the same picture. If you're tracking changed relfilenodes, you'd better track dropped ones as well. What I am thinking about is the way GetNewRelFileNode() checks for preexisting relfilenodes. It uses SnapshotDirty to scan for existing relfilenodes for a newly created oid. Which means already dropped relations could be reused. I guess it could be as simple as using SatisfiesAny (or even better a wrapper around SatisfiesVacuum that knows about recently dead tuples). Completely aside from this issue, what keeps a relation from being dropped before we've decoded all of the changes made to its data before the point at which it was dropped? (I hope the answer isn't nothing.) Nothing. But there's no need to prevent it, it'll still be in the catalog and we don't ever access a non-catalog relation's data during decoding. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Tue, Oct 29, 2013 at 10:47 AM, Andres Freund and...@2ndquadrant.com wrote: On 2013-10-28 11:54:31 -0400, Robert Haas wrote: There's one snag I currently can see, namely that we actually need to prevent that a formerly dropped relfilenode is getting reused. Not entirely sure what the best way for that is. I'm not sure in detail, but it seems to me that this all part of the same picture. If you're tracking changed relfilenodes, you'd better track dropped ones as well. What I am thinking about is the way GetNewRelFileNode() checks for preexisting relfilenodes. It uses SnapshotDirty to scan for existing relfilenodes for a newly created oid. Which means already dropped relations could be reused. I guess it could be as simple as using SatisfiesAny (or even better a wrapper around SatisfiesVacuum that knows about recently dead tuples). I think modifying GetNewRelFileNode() is attacking the problem from the wrong end. The point is that when a table is dropped, that fact can be communicated to the same machine machinery that's been tracking the CTID-CTID mappings. Instead of saying hey, the tuples that were in relfilenode 12345 are now in relfilenode 67890 in these new positions, it can say hey, the tuples that were in relfilenode 12345 are now GONE. Completely aside from this issue, what keeps a relation from being dropped before we've decoded all of the changes made to its data before the point at which it was dropped? (I hope the answer isn't nothing.) Nothing. But there's no need to prevent it, it'll still be in the catalog and we don't ever access a non-catalog relation's data during decoding. Oh, right. But what about a drop of a user-catalog table? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-29 11:28:44 -0400, Robert Haas wrote: On Tue, Oct 29, 2013 at 10:47 AM, Andres Freund and...@2ndquadrant.com wrote: On 2013-10-28 11:54:31 -0400, Robert Haas wrote: There's one snag I currently can see, namely that we actually need to prevent that a formerly dropped relfilenode is getting reused. Not entirely sure what the best way for that is. I'm not sure in detail, but it seems to me that this all part of the same picture. If you're tracking changed relfilenodes, you'd better track dropped ones as well. What I am thinking about is the way GetNewRelFileNode() checks for preexisting relfilenodes. It uses SnapshotDirty to scan for existing relfilenodes for a newly created oid. Which means already dropped relations could be reused. I guess it could be as simple as using SatisfiesAny (or even better a wrapper around SatisfiesVacuum that knows about recently dead tuples). I think modifying GetNewRelFileNode() is attacking the problem from the wrong end. The point is that when a table is dropped, that fact can be communicated to the same machine machinery that's been tracking the CTID-CTID mappings. Instead of saying hey, the tuples that were in relfilenode 12345 are now in relfilenode 67890 in these new positions, it can say hey, the tuples that were in relfilenode 12345 are now GONE. Unfortunately I don't understand what you're suggesting. What I am worried about is something like: - decoding is here VACUUM FULL pg_class; -- rewrites filenode 1 to 2 VACUUM FULL pg_class; -- rewrites filenode 2 to 3 VACUUM FULL pg_class; -- rewrites filenode 3 to 1 - now decode up to here In this case there are two possible (cmin,cmax) values for a specific tuple. One from the original filenode 1 and one for the one generated from 3. Now that will only happen if there's an oid wraparound which hopefully shouldn't happen very often, but I'd like to not rely on that. Completely aside from this issue, what keeps a relation from being dropped before we've decoded all of the changes made to its data before the point at which it was dropped? (I hope the answer isn't nothing.) Nothing. But there's no need to prevent it, it'll still be in the catalog and we don't ever access a non-catalog relation's data during decoding. Oh, right. But what about a drop of a user-catalog table? Currently nothing prevents that. I am not sure it's worth worrying about it, do you think we should? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Tue, Oct 29, 2013 at 11:43 AM, Andres Freund and...@2ndquadrant.com wrote: I think modifying GetNewRelFileNode() is attacking the problem from the wrong end. The point is that when a table is dropped, that fact can be communicated to the same machine machinery that's been tracking the CTID-CTID mappings. Instead of saying hey, the tuples that were in relfilenode 12345 are now in relfilenode 67890 in these new positions, it can say hey, the tuples that were in relfilenode 12345 are now GONE. Unfortunately I don't understand what you're suggesting. What I am worried about is something like: - decoding is here VACUUM FULL pg_class; -- rewrites filenode 1 to 2 VACUUM FULL pg_class; -- rewrites filenode 2 to 3 VACUUM FULL pg_class; -- rewrites filenode 3 to 1 - now decode up to here In this case there are two possible (cmin,cmax) values for a specific tuple. One from the original filenode 1 and one for the one generated from 3. Now that will only happen if there's an oid wraparound which hopefully shouldn't happen very often, but I'd like to not rely on that. Ah, OK. I didn't properly understand the scenario you were concerned about. There's only a potential problem here if we get behind by more than 4 billion relfilenodes, which seems remote, but maybe not: http://www.pgcon.org/2013/schedule/events/595.en.html This still seems to me to be basically an accounting problem. At any given time, we should *know* where the catalog tuples are located. We can't be decoding changes that require a given system catalog while that system catalog is locked, so any given decoding operation happens either before or after, not during, the rewrite of the corresponding catalog. As long as that VACUUM FULL operation is responsible for updating the logical decoding metadata, we should be fine. Any relcache entries referencing the old relfilenode need to be invalidated, and any CTID-[cmin,cmax] maps we're storing for those old relfilenodes need to be invalidated, too. Completely aside from this issue, what keeps a relation from being dropped before we've decoded all of the changes made to its data before the point at which it was dropped? (I hope the answer isn't nothing.) Nothing. But there's no need to prevent it, it'll still be in the catalog and we don't ever access a non-catalog relation's data during decoding. Oh, right. But what about a drop of a user-catalog table? Currently nothing prevents that. I am not sure it's worth worrying about it, do you think we should? Maybe. Depends partly on how ugly things get if it happens, I suppose. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Fri, Oct 25, 2013 at 7:57 AM, Andres Freund and...@2ndquadrant.com wrote: However, I'm leery about the idea of using a relation fork for this. I'm not sure whether that's what you had it mind, but it gives me the willies. First, it adds distributed overhead to the system, as previously discussed; and second, I think the accounting may be kind of tricky, especially in the face of multiple rewrites. I'd be more inclined to find a separate place to store the mappings. Note that, AFAICS, there's no real need for the mapping file to be block-structured, and I believe they'll be written first (with no readers) and subsequently only read (with no further writes) and eventually deleted. I was thinking of storing it along other data used during logical decoding and let decoding's cleanup clean up that data as well. All the information for that should be there. That seems OK. There's one snag I currently can see, namely that we actually need to prevent that a formerly dropped relfilenode is getting reused. Not entirely sure what the best way for that is. I'm not sure in detail, but it seems to me that this all part of the same picture. If you're tracking changed relfilenodes, you'd better track dropped ones as well. Completely aside from this issue, what keeps a relation from being dropped before we've decoded all of the changes made to its data before the point at which it was dropped? (I hope the answer isn't nothing.) One possible objection to this is that it would preclude decoding on a standby, which seems like a likely enough thing to want to do. So maybe it's best to WAL-log the changes to the mapping file so that the standby can reconstruct it if needed. The mapping file probably can be one big wal record, so it should be easy enough to do. It might be better to batch it, because if you rewrite a big relation, and the record is really big, everyone else will be frozen out of inserting WAL for as long as that colossal record is being written and synced. If it's inserted in reasonably-sized chunks, the rest of the system won't be starved as badly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Fri, Oct 25, 2013 at 8:14 AM, Andres Freund and...@2ndquadrant.com wrote: So, I thought about this for some more and I think I've a partial solution to the problem. The worst thing about deadlocks that occur in the above is that they could be the VACUUM FULL waiting for the restart LSN[1] of a decoding slot to progress, but the restart LSN cannot progress because the slot is waiting for a xid/transaction to end which is being blocked by the lock upgrade from VACUUM FULL. Such conflicts are not visible to the deadlock detector, which obviously is bad. I've prototyped this (~25 lines) and this happens pretty frequently. But it turns out that we can actually fix this by exporting (to shared memory) the oldest in-progress xid of a decoding slot. Then the waiting code can do a XactLockTableWait() for that xid... I wonder if this is isn't maybe sufficient. Yes, it can deadlock, but that's already the case for VACUUM FULLs of system tables, although less likely. And it will be detected/handled. There's one more snag though, we currently allow CLUSTER system_table; in an existing transaction. I think that'd have to be disallowed. It wouldn't bother me too much to restrict CLUSTER system_table by PreventTransactionChain() at wal_level = logical, but obviously it would be nicer if we *didn't* have to do that. In general, I don't think waiting on an XID is sufficient because a process can acquire a heavyweight lock without having an XID. Perhaps use the VXID instead? One thought I had about waiting for decoding to catch up is that you might do it before acquiring the lock. Of course, you then have a problem if you get behind again before acquiring the lock. It's tempting to adopt the solution we used for RangeVarGetRelidExtended, namely: wait for catchup without the lock, acquire the lock, see whether we're still caught up if so great else release lock and loop. But there's probably too much starvation risk to get away with that. On the whole, I'm leaning toward thinking that the other solution (recording the old-to-new CTID mappings generated by CLUSTER to the extent that they are needed) is probably more elegant. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-28 12:04:01 -0400, Robert Haas wrote: On Fri, Oct 25, 2013 at 8:14 AM, Andres Freund and...@2ndquadrant.com wrote: I wonder if this is isn't maybe sufficient. Yes, it can deadlock, but that's already the case for VACUUM FULLs of system tables, although less likely. And it will be detected/handled. There's one more snag though, we currently allow CLUSTER system_table; in an existing transaction. I think that'd have to be disallowed. It wouldn't bother me too much to restrict CLUSTER system_table by PreventTransactionChain() at wal_level = logical, but obviously it would be nicer if we *didn't* have to do that. In general, I don't think waiting on an XID is sufficient because a process can acquire a heavyweight lock without having an XID. Perhaps use the VXID instead? But decoding doesn't care about transactions that haven't used an XID yet (since that means they haven't modified the catalog), so that shouldn't be problematic. One thought I had about waiting for decoding to catch up is that you might do it before acquiring the lock. Of course, you then have a problem if you get behind again before acquiring the lock. It's tempting to adopt the solution we used for RangeVarGetRelidExtended, namely: wait for catchup without the lock, acquire the lock, see whether we're still caught up if so great else release lock and loop. But there's probably too much starvation risk to get away with that. I think we'd pretty much always starve in that case. It'd be different if we could detect that there weren't any writes to the table inbetween. I can see doing that using a locking hack like autovac uses, but brr, that'd be ugly. On the whole, I'm leaning toward thinking that the other solution (recording the old-to-new CTID mappings generated by CLUSTER to the extent that they are needed) is probably more elegant. I personally still think that the wide cmin/cmax solution is *much* more elegant, simpler and actually can be used for other things than logical decoding. Since you don't seem to agree I am going to write a prototype using such a mapping to see how it will look though. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Mon, Oct 28, 2013 at 12:17 PM, Andres Freund and...@2ndquadrant.com wrote: In general, I don't think waiting on an XID is sufficient because a process can acquire a heavyweight lock without having an XID. Perhaps use the VXID instead? But decoding doesn't care about transactions that haven't used an XID yet (since that means they haven't modified the catalog), so that shouldn't be problematic. Hmm, maybe. But what if the deadlock has more members? e.g. A is blocking decoding by holding AEL w/no XID, and B is blocking A by doing VF on a rel A needs, and decoding is blocking B. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-24 10:59:21 -0400, Robert Haas wrote: On Tue, Oct 22, 2013 at 2:13 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-10-22 13:57:53 -0400, Robert Haas wrote: On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund and...@2ndquadrant.com wrote: That strikes me as a flaw in the implementation rather than the idea. You're presupposing a patch where the necessary information is available in WAL yet you don't make use of it at the proper time. The problem is that the mapping would be somewhere *ahead* from the transaction/WAL we're currently decoding. We'd need to read ahead till we find the correct one. Yes, I think that's what you need to do. My problem with that is that rewrite can be gigabytes into the future. When reading forward we could either just continue reading data into the reorderbuffer, but delay replaying all future commits till we found the currently needed remap. That might have quite the additional storage/memory cost, but runtime complexity should be the same as normal decoding. Or we could individually read ahead for every transaction. But doing so for every transaction will get rather expensive (rougly O(amount_of_wal^2)). [ Sorry it's taken me a bit of time to get back to this; other tasks intervened, and I also just needed some time to let it settle in my brain. ] No worries. I've had enough things to work on ;) If you read ahead looking for a set of ctid translations from relfilenode A to relfilenode B, and along the way you happen to encounter a set of translations from relfilenode C to relfilenode D, you could stash that set of translations away somewhere, so that if the next transaction you process needs that set of mappings, it's already computed. With that approach, you'd never have to pre-read the same set of WAL files more than once. But, as I think about it more, that's not very different from your idea of stashing the translations someplace other than WAL in the first place. I mean, if the read-ahead thread generates a series of files in pg_somethingorother that contain those maps, you could have just written the maps to that directory in the first place. So on further review I think we could adopt that approach. Yea, that basically was my reasoning, only expressed much more nicely ;) However, I'm leery about the idea of using a relation fork for this. I'm not sure whether that's what you had it mind, but it gives me the willies. First, it adds distributed overhead to the system, as previously discussed; and second, I think the accounting may be kind of tricky, especially in the face of multiple rewrites. I'd be more inclined to find a separate place to store the mappings. Note that, AFAICS, there's no real need for the mapping file to be block-structured, and I believe they'll be written first (with no readers) and subsequently only read (with no further writes) and eventually deleted. I was thinking of storing it along other data used during logical decoding and let decoding's cleanup clean up that data as well. All the information for that should be there. There's one snag I currently can see, namely that we actually need to prevent that a formerly dropped relfilenode is getting reused. Not entirely sure what the best way for that is. One possible objection to this is that it would preclude decoding on a standby, which seems like a likely enough thing to want to do. So maybe it's best to WAL-log the changes to the mapping file so that the standby can reconstruct it if needed. The mapping file probably can be one big wal record, so it should be easy enough to do. For a moment I thought there's a problem with decoding on the standby having to read ahead of the current location to find the newer mapping, but that's actually not required since we're protected by the AEL lock during rewrites on the standby as well. I think that'd be pretty similar to just disallowing VACUUM FREEZE/CLUSTER on catalog relations since effectively it'd be to expensive to use. This seems unduly pessimistic to me; unless the catalogs are really darn big, this is a mostly theoretical problem. Well, it's not the size of the relation, but the amount of concurrent WAL that's being generated that matters. But anyway, if we do it like you described above that shouldn't be a problem. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-21 16:15:58 +0200, Andres Freund wrote: I don't think I understand exactly what you have in mind for (2); can you elaborate? I have always thought that having a WaitForDecodingToCatchUp() primitive was a good way of handling changes that were otherwise too difficult to track our way through. I am not sure you're doing that at all right now, which in some sense I guess is fine, but I haven't really understood your aversion to this solution. There are some locking issues to be worked out here, but the problems don't seem altogether intractable. So, what we need to do for rewriting catalog tables would be: 1) lock table against writes 2) wait for all in-progress xacts to finish, they could have modified the table in question (we don't keep locks on system tables) 3) acquire xlog insert pointer 4) wait for all logical decoding actions to read past that pointer 5) upgrade the lock to an access exclusive one 6) perform vacuum full as usual The lock upgrade hazards in here are the reason I am adverse to the solution. And I don't see how we can avoid them, since in order for decoding to catchup it has to be able to read from the catalog... Otherwise it's easy enough to implement. So, I thought about this for some more and I think I've a partial solution to the problem. The worst thing about deadlocks that occur in the above is that they could be the VACUUM FULL waiting for the restart LSN[1] of a decoding slot to progress, but the restart LSN cannot progress because the slot is waiting for a xid/transaction to end which is being blocked by the lock upgrade from VACUUM FULL. Such conflicts are not visible to the deadlock detector, which obviously is bad. I've prototyped this (~25 lines) and this happens pretty frequently. But it turns out that we can actually fix this by exporting (to shared memory) the oldest in-progress xid of a decoding slot. Then the waiting code can do a XactLockTableWait() for that xid... I wonder if this is isn't maybe sufficient. Yes, it can deadlock, but that's already the case for VACUUM FULLs of system tables, although less likely. And it will be detected/handled. There's one more snag though, we currently allow CLUSTER system_table; in an existing transaction. I think that'd have to be disallowed. What do you think? Greetings, Andres Freund [1] The restart LSN is the point from where we need to be able read WAL to replay all changes the receiving side hasn't acked yet. -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Tue, Oct 22, 2013 at 2:13 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-10-22 13:57:53 -0400, Robert Haas wrote: On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund and...@2ndquadrant.com wrote: That strikes me as a flaw in the implementation rather than the idea. You're presupposing a patch where the necessary information is available in WAL yet you don't make use of it at the proper time. The problem is that the mapping would be somewhere *ahead* from the transaction/WAL we're currently decoding. We'd need to read ahead till we find the correct one. Yes, I think that's what you need to do. My problem with that is that rewrite can be gigabytes into the future. When reading forward we could either just continue reading data into the reorderbuffer, but delay replaying all future commits till we found the currently needed remap. That might have quite the additional storage/memory cost, but runtime complexity should be the same as normal decoding. Or we could individually read ahead for every transaction. But doing so for every transaction will get rather expensive (rougly O(amount_of_wal^2)). [ Sorry it's taken me a bit of time to get back to this; other tasks intervened, and I also just needed some time to let it settle in my brain. ] If you read ahead looking for a set of ctid translations from relfilenode A to relfilenode B, and along the way you happen to encounter a set of translations from relfilenode C to relfilenode D, you could stash that set of translations away somewhere, so that if the next transaction you process needs that set of mappings, it's already computed. With that approach, you'd never have to pre-read the same set of WAL files more than once. But, as I think about it more, that's not very different from your idea of stashing the translations someplace other than WAL in the first place. I mean, if the read-ahead thread generates a series of files in pg_somethingorother that contain those maps, you could have just written the maps to that directory in the first place. So on further review I think we could adopt that approach. However, I'm leery about the idea of using a relation fork for this. I'm not sure whether that's what you had it mind, but it gives me the willies. First, it adds distributed overhead to the system, as previously discussed; and second, I think the accounting may be kind of tricky, especially in the face of multiple rewrites. I'd be more inclined to find a separate place to store the mappings. Note that, AFAICS, there's no real need for the mapping file to be block-structured, and I believe they'll be written first (with no readers) and subsequently only read (with no further writes) and eventually deleted. One possible objection to this is that it would preclude decoding on a standby, which seems like a likely enough thing to want to do. So maybe it's best to WAL-log the changes to the mapping file so that the standby can reconstruct it if needed. I think that'd be pretty similar to just disallowing VACUUM FREEZE/CLUSTER on catalog relations since effectively it'd be to expensive to use. This seems unduly pessimistic to me; unless the catalogs are really darn big, this is a mostly theoretical problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund and...@2ndquadrant.com wrote: So. As it turns out that solution isn't sufficient in the face of VACUUM FULL and mixed DML/DDL transaction that have not yet been decoded. To reiterate, as published it works like: For every modification of catalog tuple (insert, multi_insert, update, delete) that has influence over visibility issue a record that contains: * filenode * ctid * (cmin, cmax) When doing a visibility check on a catalog row during decoding of mixed DML/DDL transaction lookup (cmin, cmax) for that row since we don't store both for the tuple. That mostly works great. The problematic scenario is decoding a transaction that has done mixed DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM FULL obviously changes the filenode and the ctid of a tuple, so we cannot successfully do a lookup based on what we logged before. So I have a new idea for handling this problem, which seems obvious in retrospect. What if we make the VACUUM FULL or CLUSTER log the old CTID - new CTID mappings? This would only need to be done for catalog tables, and maybe could be skipped for tuples whose XIDs are old enough that we know those transactions must already be decoded. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-22 10:52:48 -0400, Robert Haas wrote: On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund and...@2ndquadrant.com wrote: So. As it turns out that solution isn't sufficient in the face of VACUUM FULL and mixed DML/DDL transaction that have not yet been decoded. To reiterate, as published it works like: For every modification of catalog tuple (insert, multi_insert, update, delete) that has influence over visibility issue a record that contains: * filenode * ctid * (cmin, cmax) When doing a visibility check on a catalog row during decoding of mixed DML/DDL transaction lookup (cmin, cmax) for that row since we don't store both for the tuple. That mostly works great. The problematic scenario is decoding a transaction that has done mixed DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM FULL obviously changes the filenode and the ctid of a tuple, so we cannot successfully do a lookup based on what we logged before. So I have a new idea for handling this problem, which seems obvious in retrospect. What if we make the VACUUM FULL or CLUSTER log the old CTID - new CTID mappings? This would only need to be done for catalog tables, and maybe could be skipped for tuples whose XIDs are old enough that we know those transactions must already be decoded. Ah. If it only were so simple ;). That was my first idea, and after I'd bragged in an 2ndq internal chat that I'd found a simple idea I obviously had to realize it doesn't work. Consider: INIT_LOGICAL_REPLICATION; CREATE TABLE foo(...); BEGIN; INSERT INTO foo; ALTER TABLE foo ...; INSERT INTO foo; COMMIT TX 3; VACUUM FULL pg_class; START_LOGICAL_REPLICATION; When we decode tx 3 we haven't yet read the mapping from the vacuum freeze. That scenario can happen either because decoding was stopped for a moment, or because decoding couldn't keep up (slow connection, whatever). There also can be nasty variations where the VACUUM FULL happens while a transaction is writing data since we don't hold locks on system relations for very long. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Tue, Oct 22, 2013 at 11:02 AM, Andres Freund and...@2ndquadrant.com wrote: On 2013-10-22 10:52:48 -0400, Robert Haas wrote: On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund and...@2ndquadrant.com wrote: So. As it turns out that solution isn't sufficient in the face of VACUUM FULL and mixed DML/DDL transaction that have not yet been decoded. To reiterate, as published it works like: For every modification of catalog tuple (insert, multi_insert, update, delete) that has influence over visibility issue a record that contains: * filenode * ctid * (cmin, cmax) When doing a visibility check on a catalog row during decoding of mixed DML/DDL transaction lookup (cmin, cmax) for that row since we don't store both for the tuple. That mostly works great. The problematic scenario is decoding a transaction that has done mixed DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM FULL obviously changes the filenode and the ctid of a tuple, so we cannot successfully do a lookup based on what we logged before. So I have a new idea for handling this problem, which seems obvious in retrospect. What if we make the VACUUM FULL or CLUSTER log the old CTID - new CTID mappings? This would only need to be done for catalog tables, and maybe could be skipped for tuples whose XIDs are old enough that we know those transactions must already be decoded. Ah. If it only were so simple ;). That was my first idea, and after I'd bragged in an 2ndq internal chat that I'd found a simple idea I obviously had to realize it doesn't work. Consider: INIT_LOGICAL_REPLICATION; CREATE TABLE foo(...); BEGIN; INSERT INTO foo; ALTER TABLE foo ...; INSERT INTO foo; COMMIT TX 3; VACUUM FULL pg_class; START_LOGICAL_REPLICATION; When we decode tx 3 we haven't yet read the mapping from the vacuum freeze. That scenario can happen either because decoding was stopped for a moment, or because decoding couldn't keep up (slow connection, whatever). That strikes me as a flaw in the implementation rather than the idea. You're presupposing a patch where the necessary information is available in WAL yet you don't make use of it at the proper time. It seems to me that you have to think of the CTID map as tied to a relfilenode; if you try to use one relfilenode's map with a different relfilenode, it's obviously not going to work. So don't do that. /me looks innocent. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-18 20:26:16 +0200, Andres Freund wrote: 4) Store both (cmin, cmax) for catalog tuples. BTW: That would have the nice side-effect of delivering the basis of what you need to do parallel sort in a transaction that previously has performed DDL. Currently you cannot do anything in parallel after DDL, even if you only scan the table in one backend, because operators et al. have to do catalog lookups which you can't do consistently since cmin/cmax aren't available in both. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 22.10.2013 19:12, Andres Freund wrote: On 2013-10-18 20:26:16 +0200, Andres Freund wrote: 4) Store both (cmin, cmax) for catalog tuples. BTW: That would have the nice side-effect of delivering the basis of what you need to do parallel sort in a transaction that previously has performed DDL. Currently you cannot do anything in parallel after DDL, even if you only scan the table in one backend, because operators et al. have to do catalog lookups which you can't do consistently since cmin/cmax aren't available in both. Parallel workers will need cmin/cmax for user tables too, to know which tuples are visible to the snapshot. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-22 19:19:19 +0300, Heikki Linnakangas wrote: On 22.10.2013 19:12, Andres Freund wrote: On 2013-10-18 20:26:16 +0200, Andres Freund wrote: 4) Store both (cmin, cmax) for catalog tuples. BTW: That would have the nice side-effect of delivering the basis of what you need to do parallel sort in a transaction that previously has performed DDL. Currently you cannot do anything in parallel after DDL, even if you only scan the table in one backend, because operators et al. have to do catalog lookups which you can't do consistently since cmin/cmax aren't available in both. Parallel workers will need cmin/cmax for user tables too, to know which tuples are visible to the snapshot. The existing proposals were mostly about just parallelizing the sort and similar operations, right? In such scenarios you really need it only for the catalog. But we could easily generalize it for user data too. We should even be able to only use wide cids when we a backend needs it it since inherently it's only needed within a single transaction. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 22.10.2013 19:23, Andres Freund wrote: On 2013-10-22 19:19:19 +0300, Heikki Linnakangas wrote: On 22.10.2013 19:12, Andres Freund wrote: On 2013-10-18 20:26:16 +0200, Andres Freund wrote: 4) Store both (cmin, cmax) for catalog tuples. BTW: That would have the nice side-effect of delivering the basis of what you need to do parallel sort in a transaction that previously has performed DDL. Currently you cannot do anything in parallel after DDL, even if you only scan the table in one backend, because operators et al. have to do catalog lookups which you can't do consistently since cmin/cmax aren't available in both. Parallel workers will need cmin/cmax for user tables too, to know which tuples are visible to the snapshot. The existing proposals were mostly about just parallelizing the sort and similar operations, right? In such scenarios you really need it only for the catalog. But we could easily generalize it for user data too. We should even be able to only use wide cids when we a backend needs it it since inherently it's only needed within a single transaction. Or just hand over a copy of the combocid map to the worker, along with the snapshot. Seems a lot simpler than this wide cids business.. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Tue, Oct 22, 2013 at 12:25 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Or just hand over a copy of the combocid map to the worker, along with the snapshot. Seems a lot simpler than this wide cids business.. Yes, that's what Noah and I talked about doing. Or possibly even making the map into a hash table in dynamic shared memory, so that new combo CIDs could be allocated by any backend in the parallel group. But that seems hard, so for starters I think we'll only parallelize read-only operations and just hand over a copy of the map. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-22 19:25:31 +0300, Heikki Linnakangas wrote: On 22.10.2013 19:23, Andres Freund wrote: On 2013-10-22 19:19:19 +0300, Heikki Linnakangas wrote: On 22.10.2013 19:12, Andres Freund wrote: On 2013-10-18 20:26:16 +0200, Andres Freund wrote: 4) Store both (cmin, cmax) for catalog tuples. BTW: That would have the nice side-effect of delivering the basis of what you need to do parallel sort in a transaction that previously has performed DDL. Currently you cannot do anything in parallel after DDL, even if you only scan the table in one backend, because operators et al. have to do catalog lookups which you can't do consistently since cmin/cmax aren't available in both. Parallel workers will need cmin/cmax for user tables too, to know which tuples are visible to the snapshot. The existing proposals were mostly about just parallelizing the sort and similar operations, right? In such scenarios you really need it only for the catalog. But we could easily generalize it for user data too. We should even be able to only use wide cids when we a backend needs it it since inherently it's only needed within a single transaction. Or just hand over a copy of the combocid map to the worker, along with the snapshot. Seems a lot simpler than this wide cids business.. That's not sufficient if you want to continue writing in the primary backend though which isn't an uninteresting thing. I am not saying that parallel XXX is a sufficient reason for this, just that it might a be a co-benefactor. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-22 11:59:35 -0400, Robert Haas wrote: So I have a new idea for handling this problem, which seems obvious in retrospect. What if we make the VACUUM FULL or CLUSTER log the old CTID - new CTID mappings? This would only need to be done for catalog tables, and maybe could be skipped for tuples whose XIDs are old enough that we know those transactions must already be decoded. Ah. If it only were so simple ;). That was my first idea, and after I'd bragged in an 2ndq internal chat that I'd found a simple idea I obviously had to realize it doesn't work. Consider: INIT_LOGICAL_REPLICATION; CREATE TABLE foo(...); BEGIN; INSERT INTO foo; ALTER TABLE foo ...; INSERT INTO foo; COMMIT TX 3; VACUUM FULL pg_class; START_LOGICAL_REPLICATION; When we decode tx 3 we haven't yet read the mapping from the vacuum freeze. That scenario can happen either because decoding was stopped for a moment, or because decoding couldn't keep up (slow connection, whatever). It seems to me that you have to think of the CTID map as tied to a relfilenode; if you try to use one relfilenode's map with a different relfilenode, it's obviously not going to work. So don't do that. It has to be tied to relfilenode (+ctid) *and* transaction unfortunately. That strikes me as a flaw in the implementation rather than the idea. You're presupposing a patch where the necessary information is available in WAL yet you don't make use of it at the proper time. The problem is that the mapping would be somewhere *ahead* from the transaction/WAL we're currently decoding. We'd need to read ahead till we find the correct one. But I think I mainly misunderstood what you proposed. That mapping could be written besides relfilenode, instead of into the WAL. Then my imagined problem doesn't exist anymore. We only would need to write out mappings for tuples modified since the xmin horizon, so it wouldn't even be *too* bad for bigger relations. This won't easily work for two+ rewrites because we'd need to apply all mappings in order and thus would have to keep a history of intermediate nodes/mappings. But it'd be perfectly doable to simply wait till decoders are caught up. I still feel that simply storing both cmin, cmax is cleaner, but if that's not acceptable, I can certainly live with something like this. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund and...@2ndquadrant.com wrote: It seems to me that you have to think of the CTID map as tied to a relfilenode; if you try to use one relfilenode's map with a different relfilenode, it's obviously not going to work. So don't do that. It has to be tied to relfilenode (+ctid) *and* transaction unfortunately. I agree that it does, but it doesn't seem particularly unfortunate to me. That strikes me as a flaw in the implementation rather than the idea. You're presupposing a patch where the necessary information is available in WAL yet you don't make use of it at the proper time. The problem is that the mapping would be somewhere *ahead* from the transaction/WAL we're currently decoding. We'd need to read ahead till we find the correct one. Yes, I think that's what you need to do. But I think I mainly misunderstood what you proposed. That mapping could be written besides relfilenode, instead of into the WAL. Then my imagined problem doesn't exist anymore. That's pretty ugly. I think it should be written into WAL. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-22 13:57:53 -0400, Robert Haas wrote: On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund and...@2ndquadrant.com wrote: That strikes me as a flaw in the implementation rather than the idea. You're presupposing a patch where the necessary information is available in WAL yet you don't make use of it at the proper time. The problem is that the mapping would be somewhere *ahead* from the transaction/WAL we're currently decoding. We'd need to read ahead till we find the correct one. Yes, I think that's what you need to do. My problem with that is that rewrite can be gigabytes into the future. When reading forward we could either just continue reading data into the reorderbuffer, but delay replaying all future commits till we found the currently needed remap. That might have quite the additional storage/memory cost, but runtime complexity should be the same as normal decoding. Or we could individually read ahead for every transaction. But doing so for every transaction will get rather expensive (rougly O(amount_of_wal^2)). I think that'd be pretty similar to just disallowing VACUUM FREEZE/CLUSTER on catalog relations since effectively it'd be to expensive to use. But I think I mainly misunderstood what you proposed. That mapping could be written besides relfilenode, instead of into the WAL. Then my imagined problem doesn't exist anymore. That's pretty ugly. I think it should be written into WAL. It basically has O(1) access, that's why I was thinking about it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund and...@2ndquadrant.com wrote: I know of the following solutions: 1) Don't allow VACUUM FULL on catalog tables if wal_level = logical. 2) Make VACUUM FULL prevent DDL and then wait till all changestreams have decoded up to the current point. 3) don't delete the old relfilenode for VACUUM/CLUSTERs of system tables if there are life decoding slots around, instead delegate that responsibility to the slot management. 4) Store both (cmin, cmax) for catalog tuples. I bascially think only 1) and 4) are realistic. And 1) sucks. I've developed a prototype for 4) and except currently being incredibly ugly, it seems to be the most promising approach by far. My trick to store both cmin and cmax is to store cmax in t_hoff managed space when wal_level = logical. In my opinion, (4) is too ugly to consider. I think that if we start playing games like this, we're opening up the doors to lots of subtle bugs and future architectural pain that will be with us for many, many years to come. I believe we will bitterly regret any foray into this area. It has long seemed to me to be a shame that we don't have some system for allowing old relfilenodes to stick around until they are no longer in use. If we had that, we might be able to allow utilities like CLUSTER or VACUUM FULL to permit concurrent read access to the table. I realize that what people really want is to let those things run while allowing concurrent *write* access to the table, but a bird in the hand is worth two in the bush. What we're really talking about here is applying MVCC to filesystem actions: instead of removing the old relfilenode(s) immediately, we do it when they're no longer referenced by anyone, just as we don't remove old tuples immediately, but rather when they are no longer referenced by anyone. The details are tricky, though: we can allow write access to the *new* heap just as soon as the rewrite is finished, but anyone who is still looking at the *old* heap can't ever upgrade their AccessShareLock to anything higher, or hilarity will ensue. Also, if they lock some *other* relation and AcceptInvalidationMessages(), their relcache entry for the rewritten relation will get rebuilt, and that's bound to work out poorly. The net-net here is that I think (3) is an attractive solution, but I don't know that we can make it work in a reasonable amount of time. I don't think I understand exactly what you have in mind for (2); can you elaborate? I have always thought that having a WaitForDecodingToCatchUp() primitive was a good way of handling changes that were otherwise too difficult to track our way through. I am not sure you're doing that at all right now, which in some sense I guess is fine, but I haven't really understood your aversion to this solution. There are some locking issues to be worked out here, but the problems don't seem altogether intractable. (1) is basically deciding not to fix the problem. I don't think that's acceptable. I don't have another idea right at the moment. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-21 09:32:12 -0400, Robert Haas wrote: On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund and...@2ndquadrant.com wrote: I know of the following solutions: 1) Don't allow VACUUM FULL on catalog tables if wal_level = logical. 2) Make VACUUM FULL prevent DDL and then wait till all changestreams have decoded up to the current point. 3) don't delete the old relfilenode for VACUUM/CLUSTERs of system tables if there are life decoding slots around, instead delegate that responsibility to the slot management. 4) Store both (cmin, cmax) for catalog tuples. I bascially think only 1) and 4) are realistic. And 1) sucks. I've developed a prototype for 4) and except currently being incredibly ugly, it seems to be the most promising approach by far. My trick to store both cmin and cmax is to store cmax in t_hoff managed space when wal_level = logical. In my opinion, (4) is too ugly to consider. I think that if we start playing games like this, we're opening up the doors to lots of subtle bugs and future architectural pain that will be with us for many, many years to come. I believe we will bitterly regret any foray into this area. Hm. After looking at the required code - which you obviously cannot have yet - it's not actually too bad. Will post a patch implementing it later. I don't really buy the architectural argument since originally cmin/cmax *were* both stored. It's not something we're just inventing now. We just optimized that away but now have discovered that's not always a good idea and thus don't always use the optimization. The actual decoding code shrinks by about 200 lines using this logic which is a hint that it's not a bad idea. It has long seemed to me to be a shame that we don't have some system for allowing old relfilenodes to stick around until they are no longer in use. If we had that, we might be able to allow utilities like CLUSTER or VACUUM FULL to permit concurrent read access to the table. I realize that what people really want is to let those things run while allowing concurrent *write* access to the table, but a bird in the hand is worth two in the bush. What we're really talking about here is applying MVCC to filesystem actions: instead of removing the old relfilenode(s) immediately, we do it when they're no longer referenced by anyone, just as we don't remove old tuples immediately, but rather when they are no longer referenced by anyone. The details are tricky, though: we can allow write access to the *new* heap just as soon as the rewrite is finished, but anyone who is still looking at the *old* heap can't ever upgrade their AccessShareLock to anything higher, or hilarity will ensue. Also, if they lock some *other* relation and AcceptInvalidationMessages(), their relcache entry for the rewritten relation will get rebuilt, and that's bound to work out poorly. The net-net here is that I think (3) is an attractive solution, but I don't know that we can make it work in a reasonable amount of time. I've looked at it before, and I honestly don't have a real clue how to do it robustly. I don't think I understand exactly what you have in mind for (2); can you elaborate? I have always thought that having a WaitForDecodingToCatchUp() primitive was a good way of handling changes that were otherwise too difficult to track our way through. I am not sure you're doing that at all right now, which in some sense I guess is fine, but I haven't really understood your aversion to this solution. There are some locking issues to be worked out here, but the problems don't seem altogether intractable. So, what we need to do for rewriting catalog tables would be: 1) lock table against writes 2) wait for all in-progress xacts to finish, they could have modified the table in question (we don't keep locks on system tables) 3) acquire xlog insert pointer 4) wait for all logical decoding actions to read past that pointer 5) upgrade the lock to an access exclusive one 6) perform vacuum full as usual The lock upgrade hazards in here are the reason I am adverse to the solution. And I don't see how we can avoid them, since in order for decoding to catchup it has to be able to read from the catalog... Otherwise it's easy enough to implement. (1) is basically deciding not to fix the problem. I don't think that's acceptable. I'd like to argue against this, but unfortunately I agree. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Mon, Oct 21, 2013 at 1:52 PM, Andres Freund and...@2ndquadrant.com wrote: In my opinion, (4) is too ugly to consider. I think that if we start playing games like this, we're opening up the doors to lots of subtle bugs and future architectural pain that will be with us for many, many years to come. I believe we will bitterly regret any foray into this area. Hm. After looking at the required code - which you obviously cannot have yet - it's not actually too bad. Will post a patch implementing it later. I don't really buy the architectural argument since originally cmin/cmax *were* both stored. It's not something we're just inventing now. We just optimized that away but now have discovered that's not always a good idea and thus don't always use the optimization. The actual decoding code shrinks by about 200 lines using this logic which is a hint that it's not a bad idea. So, here's a preliminary patch to see how this would look. It'd be great of you comment if you still think it's a completel no-go. If it were for real, it'd need to be split and some minor things would need to get adjusted, but I think it's easier to review it seing both sides at once. I think it's a complete no-go. Consider, e.g., the comment for MaxTupleAttributeNumber, which you've blithely falsified. Even if you update the comment and the value, I'm not inspired by the idea of subtracting 32 from that number; even if it weren't already too small, it would break pg_upgrade for any users who are on the edge. Things aren't looking too good for anything that uses HeapTupleFields, either; consider rewrite_heap_tuple(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-21 14:22:17 -0400, Robert Haas wrote: On Mon, Oct 21, 2013 at 1:52 PM, Andres Freund and...@2ndquadrant.com wrote: In my opinion, (4) is too ugly to consider. I think that if we start playing games like this, we're opening up the doors to lots of subtle bugs and future architectural pain that will be with us for many, many years to come. I believe we will bitterly regret any foray into this area. Hm. After looking at the required code - which you obviously cannot have yet - it's not actually too bad. Will post a patch implementing it later. I don't really buy the architectural argument since originally cmin/cmax *were* both stored. It's not something we're just inventing now. We just optimized that away but now have discovered that's not always a good idea and thus don't always use the optimization. The actual decoding code shrinks by about 200 lines using this logic which is a hint that it's not a bad idea. So, here's a preliminary patch to see how this would look. It'd be great of you comment if you still think it's a completel no-go. If it were for real, it'd need to be split and some minor things would need to get adjusted, but I think it's easier to review it seing both sides at once. I think it's a complete no-go. Consider, e.g., the comment for MaxTupleAttributeNumber, which you've blithely falsified. Even if you update the comment and the value, I'm not inspired by the idea of subtracting 32 from that number; even if it weren't already too small, it would break pg_upgrade for any users who are on the edge. Well, we only need to support it for (user_)catalog tables. So pg_upgrade isn't a problem. And I don't really see a problem restricting the number of columns for such tables. Things aren't looking too good for anything that uses HeapTupleFields, either; consider rewrite_heap_tuple(). Well, that currently works, by copying cmax. Since rewriting triggered the change, I am pretty sure I've actually tested hit that path... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Mon, Oct 21, 2013 at 2:27 PM, Andres Freund and...@2ndquadrant.com wrote: I think it's a complete no-go. Consider, e.g., the comment for MaxTupleAttributeNumber, which you've blithely falsified. Even if you update the comment and the value, I'm not inspired by the idea of subtracting 32 from that number; even if it weren't already too small, it would break pg_upgrade for any users who are on the edge. Well, we only need to support it for (user_)catalog tables. So pg_upgrade isn't a problem. And I don't really see a problem restricting the number of columns for such tables. Inch by inch, this patch set is trying to make catalog tables more and more different from regular tables. I think that's a direction we're going to regret. I can almost believe that no great harm will come to us from giving the two different xmin horizons, as previously discussed, though I'm still not 100% happy about that. Can't both have something be a catalog table AND replicate it? Ick, but OK. But changing the on disk format strikes me as crossing some sort of bright line. That means that you're going to have two different code paths in a lot of important cases, one for catalog tables and one for non-catalog tables, and the one that's only taken for catalog tables will be rather lightly traveled. And then you've got user catalog tables, and the possibility that something that wasn't formerly a user catalog table might become one, or visca versa. Even if you can flush out every bug that exists today, this is a recipe for future bugs. Things aren't looking too good for anything that uses HeapTupleFields, either; consider rewrite_heap_tuple(). Well, that currently works, by copying cmax. Since rewriting triggered the change, I am pretty sure I've actually tested hit that path... No offense, but that's a laughable statement. If that path works, it's mostly if not entirely by accident. You've fundamentally changed the heap tuple format, and that code doesn't know about it, even though it's deeply in bed with assumptions about the old format. I think this is a pretty clear indication as to what's wrong with this approach: a lot of stuff will not care, but the stuff that does care will be hard to find, and future incremental modifications either to that code or to the hidden data before the t_hoff pointer could break stuff that formerly worked. We rejected early drafts of sepgsql RLS cold because they changed the tuple format, and I don't see why we shouldn't do exactly the same thing here. But just suppose for a minute that we'd accepted that proposal and then took this one, too. And let's suppose we also accept the next proposal that, like that one and this one, jams something more into the heap tuple header. At that point you'll have potentially as many as 8 different maximum-number-of-attributes values for tuples, though maybe not quite that many in practice if not all of those features can be used together. The macros that are needed to extract the various values from the heap tuple will be nightmarishly complex, and we'll have eaten up all (or more than all) of our remaining bit-space in the infomask. Maybe all of that sounds all right to you, but to me it sounds like a big mess. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-21 14:50:54 -0400, Robert Haas wrote: On Mon, Oct 21, 2013 at 2:27 PM, Andres Freund and...@2ndquadrant.com wrote: I think it's a complete no-go. Consider, e.g., the comment for MaxTupleAttributeNumber, which you've blithely falsified. Even if you update the comment and the value, I'm not inspired by the idea of subtracting 32 from that number; even if it weren't already too small, it would break pg_upgrade for any users who are on the edge. Well, we only need to support it for (user_)catalog tables. So pg_upgrade isn't a problem. And I don't really see a problem restricting the number of columns for such tables. Inch by inch, this patch set is trying to make catalog tables more and more different from regular tables. I think that's a direction we're going to regret. I can understand that. Can't both have something be a catalog table AND replicate it? Ick, but OK. User catalog tables are replicated normally. If we want we can replicate system catalog tables as well, ok, it's about a days worth of work. I just don't see what the use case would be and I am afraid it'd be used as a poor man's system table trigger. But changing the on disk format strikes me as crossing some sort of bright line. That means that you're going to have two different code paths in a lot of important cases, one for catalog tables and one for non-catalog tables, and the one that's only taken for catalog tables will be rather lightly traveled. But there really isn't that much difference in the code paths, is there? If you look at it, minus the mechanical changes around bitmasks it really mostly just a couple of added HeapTupleHeaderSetCmin/HeapTupleHeaderSetCmax calls + wal logging the current cid. And then you've got user catalog tables, and the possibility that something that wasn't formerly a user catalog table might become one, or visca versa. That's not really a problem - we only need cmin/cmax for catalog tables when decoding DML that happened in the same transaction as DDL. The wide cids could easily be removed when we freeze tuples or such. Even something like: BEGIN; CREATE TABLE catalog_t(...); INSERT INTO catalog_t ...; UPDATE catalog_t ... ALTER TABLE catalog_t SET (user_catalog_table = true); INSERT INTO decoded_table (..); INSERT INTO decoded_table (..); UPDATE catalog_t SET ...; INSERT INTO decoded_table (..); COMMIT; will work correctly. Even if you can flush out every bug that exists today, this is a recipe for future bugs. I don't really forsee many new codepaths that care about the visibility bits in tuple headers themselves. When has the last one of those been added/changed? I'd bet it was 9.0 with the vacuum full/cluster merge, and three lines to be added are surely the smallest problem of that. Things aren't looking too good for anything that uses HeapTupleFields, either; consider rewrite_heap_tuple(). Well, that currently works, by copying cmax. Since rewriting triggered the change, I am pretty sure I've actually tested hit that path... No offense, but that's a laughable statement. If that path works, it's mostly if not entirely by accident. You've fundamentally changed the heap tuple format, and that code doesn't know about it, even though it's deeply in bed with assumptions about the old format. Hm? rewrite_heap_tuple() grew code to handle that case, it's not an accident that it works. if (HeapTupleHeaderHasWideCid(old_tuple-t_data) HeapTupleHeaderHasWideCid(new_tuple-t_data)) { HeapTupleHeaderSetCmin(new_tuple-t_data, HeapTupleHeaderGetCmin(old_tuple-t_data)); HeapTupleHeaderSetCmax(new_tuple-t_data, HeapTupleHeaderGetCmax(old_tuple-t_data), false); } Note that code handling HeapTupleHeaders outside of heapam.c, tqual.c et al. shouldn't need to care about the changes at all. And all the code that needs to care *already* has special-cased code around visibility. It's relatively easy to find by grepping for HEAP_XACT_MASK. We've so far accepted that several places need to change if we change the visibility rules. And I don't see how we easily could get away from that. Patches like e.g. lsn-ranges freezing will require *gobs* more widespread changes. And have much higher chances of breaking stuff imo. And it's still worthwile to do them. I think this is a pretty clear indication as to what's wrong with this approach: a lot of stuff will not care, but the stuff that does care will be hard to find, and future incremental modifications either to that code or to the hidden data before the t_hoff pointer could break stuff that formerly worked. We rejected early drafts of sepgsql RLS cold because they changed the tuple format, and I don't see why we shouldn't do exactly the same thing here. Ok, I have a hard time to argue against that. I either skipped or forgot that
Re: [HACKERS] logical changeset generation v6.2
Hi, On 2013-10-21 21:36:02 +0200, Andres Freund wrote: I think this approach would have lower maintenance overhead in comparison to the previous solution of Handling CommandIds because it's actually much simpler. Btw, I think the new approach would allow for *easier* modifications about future code caring about visibility. Since the physical location doesn't matter anymore it'd be much more friendly towards things like an online compacting VACUUM and such. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-14 09:36:03 -0400, Robert Haas wrote: I thought and implemented that in the beginning. Unfortunately it's not enough :(. That's probably the issue that took me longest to understand in this patchseries... Combocids can only fix the case where a transaction actually has create a combocid: 1) TX1: INSERT id = 1 at 0/1: (xmin = 1, xmax=Invalid, cmin = 55, cmax = Invalid) 2) TX2: DELETE id = 1 at 0/1: (xmin = 1, xmax=2, cmin = Invalid, cmax = 1) So, if we're decoding data that needs to lookup those rows in TX1 or TX2 we both times need access to cmin and cmax, but neither transaction will have created a multixact. That can only be an issue in transaction with catalog modifications. Oh, yuck. So that means you have to write an extra WAL record for EVERY heap insert, update, or delete to a catalog table? OUCH. So. As it turns out that solution isn't sufficient in the face of VACUUM FULL and mixed DML/DDL transaction that have not yet been decoded. To reiterate, as published it works like: For every modification of catalog tuple (insert, multi_insert, update, delete) that has influence over visibility issue a record that contains: * filenode * ctid * (cmin, cmax) When doing a visibility check on a catalog row during decoding of mixed DML/DDL transaction lookup (cmin, cmax) for that row since we don't store both for the tuple. That mostly works great. The problematic scenario is decoding a transaction that has done mixed DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM FULL obviously changes the filenode and the ctid of a tuple, so we cannot successfully do a lookup based on what we logged before. I know of the following solutions: 1) Don't allow VACUUM FULL on catalog tables if wal_level = logical. 2) Make VACUUM FULL prevent DDL and then wait till all changestreams have decoded up to the current point. 3) don't delete the old relfilenode for VACUUM/CLUSTERs of system tables if there are life decoding slots around, instead delegate that responsibility to the slot management. 4) Store both (cmin, cmax) for catalog tuples. I bascially think only 1) and 4) are realistic. And 1) sucks. I've developed a prototype for 4) and except currently being incredibly ugly, it seems to be the most promising approach by far. My trick to store both cmin and cmax is to store cmax in t_hoff managed space when wal_level = logical. That even works when changing wal_level from logical to logical because only ever need to store both cmin and cmax for transactions that have decodeable content - which they cannot yet have before wal_level = logical. This requires some not so nice things: * A way to declare we're storing both. I've currently chosen HEAP_MOVED_OFF | HEAP_MOVED_IN. That sucks. * A way for heap_form_tuple to know it should add the necessary space to t_hoff. I've added TupleDesc-tdhaswidecid for it. * Fiddling with existing checks for HEAP_MOVED{,OFF,IN} to check for both set at the same time. * Changing the WAL logging to (optionally?) transport the current CommandId instead of always resetting it InvalidCommandId. The benefits are: * Working VACUUM FULL * Much simpler tqual.c logic, everything is stored in the row itself. No hash or something like that built. * No more need to log (relfilenode, cmin, cmax) separately from heap changes itself anymore. In the end, the costs are that individual catalog rows are 4 bytes bigger iff wal_level = logical. That seems acceptable. Some questions remain: * Better idea for a flag than HEAP_MOVED_OFF | HEAP_MOVED_IN * Should we just unconditionally log the current CommandId or make it conditional. We have plenty of flag space to signal whether it's present, but it's just 4 bytes. Comments? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Mon, Oct 14, 2013 at 9:51 AM, Andres Freund and...@2ndquadrant.com wrote: Well, I just think relying on specific symbol names in the .so file is kind of unfortunate. It means that, for example, you can't have multiple output plugins provided by a single .so. And in general I think it's something that we've tried to minimize. But that's not really different when you rely on _PG_init doing it's thing, right? Sure, that's true. But in general I think magic symbol names aren't a particularly good design. But there's only so much information available here. Why not just have a format that logs it all? Because we do not know what all is? Also, how would we handle replication sets and such that all of the existing replication solutions have generically? I don't see how you can fail to know what all is. There's only a certain set of facts available. I mean you could log irrelevant crap like a random number that you just picked or the sum of all numeric values in the column, but nobody's likely to want that. What people are going to want is the operation performed (insert, update, or delete), all the values in the new tuple, the key values from the old tuple, the transaction ID, and maybe some meta-information about the transaction (such as the commit timestamp). What I'd probably do is emit the data in CSV format, with the first column of each line being a single character indicating what sort of row this is: H means a header row, defining the format of subsequent rows (H,table_name,new_column1,...,new_columnj,old_key_column1,...,old_key_columnk; a new header row is emitted only when the column list changes); I, U, or D means an insert, update, or delete, with column 2 being the transaction ID, column 3 being the table name, and the remaining columns matching the last header row for emitted for that table, T means meta-information about a transaction, whatever we have (e.g. T,txn_id,commit_time). There's probably some further tweaking of that that could be done, and I might be overlooking some salient details, like maybe we want to indicate the column types as well as their names, but the range of things that someone can want to do here is not unlimited. The point, for me anyway, is that someone can write a crappy Perl script to apply changes from a file like this in a day. My contention is that there are a lot of people who will want to do just that, for one reason or another. The plugin interface has awesome power and flexibility, and really high-performance replication solutions will really benefit from that. But regular people don't want to write C code; they just want to write a crappy Perl script. And I think we can facilitate that without too much work. Oh, yuck. So that means you have to write an extra WAL record for EVERY heap insert, update, or delete to a catalog table? OUCH. Yes. We could integrate it into the main record without too many problems, but it didn't seem like an important optimization and it would have higher chances of slowing down wal_level logical. Hmm. I don't know whether that's an important optimization or not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Mon, Oct 14, 2013 at 5:07 PM, Andres Freund and...@2ndquadrant.com wrote: So, see the attatched benchmark skript. I've always done using a disk bound and a memory bound (using eatmydata, preventing fsyncs) run. * unpatched run, wal_level = hot_standby, eatmydata * unpatched run, wal_level = hot_standby * patched run, wal_level = hot_standby, eatmydata * patched run, wal_level = hot_standby * patched run, wal_level = logical, eatmydata * patched run, wal_level = logical Based on those results, there's no difference above noise for wal_level=hot_standby, with or without the patch. With wal_level=logical there's a measurable increase in wal traffic (~12-17%), but no performance decrease above noise. From my POV that's ok, those are really crazy catalog workloads. Any increase in WAL traffic will translate into a performance hit once the I/O channel becomes saturated, but I agree those numbers don't sound terrible for that faily-brutal test case. Actually, I was more concerned about the hit on non-catalog workloads. pgbench isn't a good test because the key column is so narrow; but suppose we have a table like (a text, b integer, c text) where (a, c) is the primary key and those strings are typically pretty long - say just short enough that we can still index the column. It'd be worth testing both workloads where the primary key doesn't change (so the only overhead is figuring out that we need not log it) and those where it does (where we're double-logging most of the tuple). I assume the latter has to produce a significant hit to WAL volume, and I don't think there's much we can do about that; but the former had better be nearly free. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-15 08:42:20 -0400, Robert Haas wrote: On Mon, Oct 14, 2013 at 9:51 AM, Andres Freund and...@2ndquadrant.com wrote: Well, I just think relying on specific symbol names in the .so file is kind of unfortunate. It means that, for example, you can't have multiple output plugins provided by a single .so. And in general I think it's something that we've tried to minimize. But that's not really different when you rely on _PG_init doing it's thing, right? Sure, that's true. But in general I think magic symbol names aren't a particularly good design. It allows you to use the shared libary both as a normal extension loaded via shared_preload_library or adhoc and as an output plugin which seems like a sensible goal. We could have a single _PG_init_output_plugin() symbol that fills in such a struct which would then not conflict with using the .so independently. If you prefer that I'll change things around. We can't do something like 'output_plugin_in_progress' before calling _PG_init() because _PG_init() won't be called again if the shared object is already loaded... But there's only so much information available here. Why not just have a format that logs it all? Because we do not know what all is? Also, how would we handle replication sets and such that all of the existing replication solutions have generically? I don't see how you can fail to know what all is. There's only a certain set of facts available. I mean you could log irrelevant crap like a random number that you just picked or the sum of all numeric values in the column, but nobody's likely to want that. What people are going to want is the operation performed (insert, update, or delete), all the values in the new tuple, the key values from the old tuple, the transaction ID, and maybe some meta-information about the transaction (such as the commit timestamp). Some will want all column names included because that makes replication into different schemas/databases easier, others won't because it makes replicating the data more complicated and expensive. Lots will want the primary key as a separate set of columns even for inserts, others not. There's also datatypes of values and null representation. What I'd probably do is emit the data in CSV format, with the first column of each line being a single character indicating what sort of row this is: H means a header row, defining the format of subsequent rows (H,table_name,new_column1,...,new_columnj,old_key_column1,...,old_key_columnk; a new header row is emitted only when the column list changes); I, U, or D means an insert, update, or delete, with column 2 being the transaction ID, column 3 being the table name, and the remaining columns matching the last header row for emitted for that table, T means meta-information about a transaction, whatever we have (e.g. T,txn_id,commit_time). There's two issues I have with this: a) CSV seems like a bad format for this. If a transaction inserts into multiple tables the number of columns will constantly change. Many CSV parsers don't deal with that all too gracefully. E.g. you can't even load the data into another postgres database as an audit log. If we go for CSV I think we should put the entire primary key as one column (containing all the columns) and the entire row another. We also don't have any nice facilities for actually writing CSV - so we'll need to start extracting escaping code from COPY. In the end all that will make the output plugin very hard to use as an example because the code will get more complicated. b) Emitting new row descriptors everytime the schema changes will require keeping track of the schema. I think that won't be trivial. It also makes consumption of the data more complicated in comparison to including the description with every row. Both are even more true once we extend the format to support streaming of transactions while they are performed. But regular people don't want to write C code; they just want to write a crappy Perl script. And I think we can facilitate that without too much work. I think the generic output plugin should be a separate one from the example one (which is the one included in the patchset). Oh, yuck. So that means you have to write an extra WAL record for EVERY heap insert, update, or delete to a catalog table? OUCH. Yes. We could integrate it into the main record without too many problems, but it didn't seem like an important optimization and it would have higher chances of slowing down wal_level logical. Hmm. I don't know whether that's an important optimization or not. Based on the benchmark I'd say no. If we discover we need to go there we can do so later. I don't forsee this to be really problematic. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-15 08:49:26 -0400, Robert Haas wrote: On Mon, Oct 14, 2013 at 5:07 PM, Andres Freund and...@2ndquadrant.com wrote: So, see the attatched benchmark skript. I've always done using a disk bound and a memory bound (using eatmydata, preventing fsyncs) run. * unpatched run, wal_level = hot_standby, eatmydata * unpatched run, wal_level = hot_standby * patched run, wal_level = hot_standby, eatmydata * patched run, wal_level = hot_standby * patched run, wal_level = logical, eatmydata * patched run, wal_level = logical Based on those results, there's no difference above noise for wal_level=hot_standby, with or without the patch. With wal_level=logical there's a measurable increase in wal traffic (~12-17%), but no performance decrease above noise. From my POV that's ok, those are really crazy catalog workloads. Any increase in WAL traffic will translate into a performance hit once the I/O channel becomes saturated, but I agree those numbers don't sound terrible for that faily-brutal test case. Well, the parallel workloads were fsync saturated although probably not throughput, that's why I added them. But yes, it's not the same as a throughput saturated IO channel. Probably the worst case real-world workload is one that uses lots and lots of ON COMMIT DROP temporary tables. Actually, I was more concerned about the hit on non-catalog workloads. pgbench isn't a good test because the key column is so narrow; but suppose we have a table like (a text, b integer, c text) where (a, c) is the primary key and those strings are typically pretty long - say just short enough that we can still index the column. It'd be worth testing both workloads where the primary key doesn't change (so the only overhead is figuring out that we need not log it) and those where it does (where we're double-logging most of the tuple). I assume the latter has to produce a significant hit to WAL volume, and I don't think there's much we can do about that; but the former had better be nearly free. Ah, ok. Then I misunderstood you. Is there a specific overhead you are afraid of in the pkey-doesn't-change scenario? The changed wal logging (buffer in a separate rdata entry) or the check whether the primary key has changed? The only way I have been able to measure differences in that scenario was to load a table with a low fillfactor and wide tuples, checkpoint, and then update lots of rows. On wal_level=logical that will result in full-page-images and tuple data being logged which can be noticeable if you have really large tuples, even if the pkey doesn't change. We could optimize that by not actually logging the tuple data in that case but just include the tid so we could extract things from the Bkp block ourselves. But that will complicate the code and doesn't yet seem warranted. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-15 15:17:58 +0200, Andres Freund wrote: If we go for CSV I think we should put the entire primary key as one column (containing all the columns) and the entire row another. What about columns like: * action B|I|U|D|C * xid * timestamp * tablename * key name * key column names * key column types * new key column values * column names * column types * column values * candidate_key_changed? * old key column values And have output plugin options * include-column-types * include-column-names * include-primary-key If something isn't included it's simply left out. What still need to be determined is: * how do we separate and escape multiple values in one CSV column * how do we represent NULLs Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Tue, Oct 15, 2013 at 9:17 AM, Andres Freund and...@2ndquadrant.com wrote: It allows you to use the shared libary both as a normal extension loaded via shared_preload_library or adhoc and as an output plugin which seems like a sensible goal. We could have a single _PG_init_output_plugin() symbol that fills in such a struct which would then not conflict with using the .so independently. If you prefer that I'll change things around. I think part of the problem may be that you're using the library name to identify the output plugin. I'm not excited about that design. For functions, you give the function a name and that is a pointer to where to actually find the function, which may be a 2-tuple library-name, function-name, or perhaps just a 1-tuple builtin-function-name, or maybe the whole text of a PL/pgsql procedure that should be compiled. Perhaps this ought to work similarly. Create a function in pg_proc which returns the structure containing the function pointers. Then, when that output plugin is selected, it'll automatically trigger loading the correct shared library if that's needed; and the shared library name may (but need not) match the output plugin name. What I'd probably do is emit the data in CSV format, with the first column of each line being a single character indicating what sort of row this is: H means a header row, defining the format of subsequent rows (H,table_name,new_column1,...,new_columnj,old_key_column1,...,old_key_columnk; a new header row is emitted only when the column list changes); I, U, or D means an insert, update, or delete, with column 2 being the transaction ID, column 3 being the table name, and the remaining columns matching the last header row for emitted for that table, T means meta-information about a transaction, whatever we have (e.g. T,txn_id,commit_time). There's two issues I have with this: a) CSV seems like a bad format for this. If a transaction inserts into multiple tables the number of columns will constantly change. Many CSV parsers don't deal with that all too gracefully. E.g. you can't even load the data into another postgres database as an audit log. We can pick some other separator. I don't think ragged CSV is a big problem; I'm actually more worried about having an easy way to handle embedded commas and newlines and so on. But I'd be fine with tab-separated data or something too, if you think that's better. What I want is something that someone can parse with a script that can be written in a reasonable amount of time in their favorite scripting language. I predict that if we provide something like this we'll vastly expand the number of users who can make use of this new functionality. User: So, what's new in PostgreSQL 9.4? Hacker: Well, now we have logical replication! User: Why is that cool? Hacker: Well, streaming replication is awesome for HA, but it has significant limitations. And trigger-based systems are very mature, but the overhead is high and their lack of core integration makes them hard to use. With this technology, you can build systems that will replicate individual tables or even parts of tables, multi-master systems, and lots of other cool stuff. User: Wow, that sounds great. How do I use it? Hacker: Well, first you write an output plugin in C using a special API. User: Hey, do you know whether the MongoDB guys came to this conference? Let's try that again. User: Wow, that sounds great. How do I use it? Hacker: Well, currently, the output gets dumped as a series of text files that are designed to be parsed using a scripting language. We have sample parsers written in Perl and Python that you can use as-is or hack up to meet your needs. Now, some users are still going to head for the hills. But at least from where I sit it sounds a hell of a lot better than the first answer. We're not going to solve all of the tooling problems around this technology in one release, for sure. But as far as 95% of our users are concerned, a C API might as well not exist at all. People WILL try to machine parse the output of whatever demo plugins we provide; so I think we should try hard to provide at least one such plugin that is designed to make that as easy as possible. If we go for CSV I think we should put the entire primary key as one column (containing all the columns) and the entire row another. We also don't have any nice facilities for actually writing CSV - so we'll need to start extracting escaping code from COPY. In the end all that will make the output plugin very hard to use as an example because the code will get more complicated. b) Emitting new row descriptors everytime the schema changes will require keeping track of the schema. I think that won't be trivial. It also makes consumption of the data more complicated in comparison to including the description with every row. Both are even more true once we extend the format to support streaming of transactions while they are
Re: [HACKERS] logical changeset generation v6.2
On 10/15/2013 01:42 PM, Robert Haas wrote: On Mon, Oct 14, 2013 at 9:51 AM, Andres Freund and...@2ndquadrant.com wrote: Well, I just think relying on specific symbol names in the .so file is kind of unfortunate. It means that, for example, you can't have multiple output plugins provided by a single .so. And in general I think it's something that we've tried to minimize. But that's not really different when you rely on _PG_init doing it's thing, right? Sure, that's true. But in general I think magic symbol names aren't a particularly good design. But there's only so much information available here. Why not just have a format that logs it all? Because we do not know what all is? Also, how would we handle replication sets and such that all of the existing replication solutions have generically? I don't see how you can fail to know what all is. We instinctively know what all is - as in the famous case of buddhist ordering a hamburger - Make me All wit Everything :) - but the requirements of different replications systems vary wildly. ... What people are going to want is the operation performed (insert, update, or delete), all the values in the new tuple, the key values from the old tuple, For multi-master / conflict resolution you may also want all old values to make sure that they have not changed on target. the difference in WAL volume can be really significant, especially in the case of DELETE, where there are no new columns. for some forms of conflict resolution we may even want to know the database user who initiated the operation. and possibly even some session variables like very_important=yes. ... The point, for me anyway, is that someone can write a crappy Perl script to apply changes from a file like this in a day. My contention is that there are a lot of people who will want to do just that, for one reason or another. The plugin interface has awesome power and flexibility, and really high-performance replication solutions will really benefit from that. But regular people don't want to write C code; they just want to write a crappy Perl script. And I think we can facilitate that without too much work. just provide a to-csv or to-json plugin and the crappy perl guys are happy. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 10/15/2013 02:47 PM, Andres Freund wrote: On 2013-10-15 15:17:58 +0200, Andres Freund wrote: If we go for CSV I think we should put the entire primary key as one column (containing all the columns) and the entire row another. just use JSON :) What about columns like: * action B|I|U|D|C * xid * timestamp * tablename * key name * key column names * key column types * new key column values * column names * column types * column values * candidate_key_changed? * old key column values And have output plugin options * include-column-types * include-column-names * include-primary-key If something isn't included it's simply left out. What still need to be determined is: * how do we separate and escape multiple values in one CSV column * how do we represent NULLs or borrow whatever possible from pg_dump as they have needed to solve most of the same problems already and consistency is good in general Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Tue, Oct 15, 2013 at 9:47 AM, Andres Freund and...@2ndquadrant.com wrote: On 2013-10-15 15:17:58 +0200, Andres Freund wrote: If we go for CSV I think we should put the entire primary key as one column (containing all the columns) and the entire row another. What about columns like: * action B|I|U|D|C BEGIN and COMMIT? * xid * timestamp * tablename * key name * key column names * key column types * new key column values * column names * column types * column values * candidate_key_changed? * old key column values Repeating the column names for every row strikes me as a nonstarter. If the plugin interface isn't rich enough to provide a convenient way to avoid that, then it needs to be fixed so that it is, because it will be a common requirement. Sure, some people may want JSON or XML output that reiterates the labels every time, but for a lot of people that's going to greatly increase the size of the output and be undesirable for that reason. What still need to be determined is: * how do we separate and escape multiple values in one CSV column * how do we represent NULLs I consider the escaping a key design decision. Ideally, it should be something that's easy to reverse from a scripting language; ideally also, it should be something similar to how we handle COPY. These goals may be in conflict; we'll have to pick something. I'm not sure that having multiple values in one column is a good plan, because now you need multiple levels of parsing to unpack the row. I'd rather just have a flat column list with a key somewhere explaining how to interpret the data. But I'm prepared to give in on that point so long as we can demonstrate that the format can be easily parsed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Tue, Oct 15, 2013 at 10:09 AM, Hannu Krosing ha...@krosing.net wrote: I don't see how you can fail to know what all is. We instinctively know what all is - as in the famous case of buddhist ordering a hamburger - Make me All wit Everything :) - but the requirements of different replications systems vary wildly. That's true to some degree, but let's not exaggerate the degree to which it is true. For multi-master / conflict resolution you may also want all old values to make sure that they have not changed on target. The patch as proposed doesn't make that information available. If you want that to be an option, now would be the right time to argue for it. for some forms of conflict resolution we may even want to know the database user who initiated the operation. and possibly even some session variables like very_important=yes. Well, if you have requirements like logging very_important=yes, then you're definitely into the territory where you need your own output plugin. I have no problem telling people who want that sort of thing that they've got to go write C code. What I'm trying to do, as Larry Wall once said, is to make simple things simple and hard things possible. The output plugin interface accomplishes the latter, but, by itself, not the former. And I think we can facilitate that without too much work. just provide a to-csv or to-json plugin and the crappy perl guys are happy. Yep, that's exactly what I'm advocating for. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-15 10:20:55 -0400, Robert Haas wrote: For multi-master / conflict resolution you may also want all old values to make sure that they have not changed on target. The patch as proposed doesn't make that information available. If you want that to be an option, now would be the right time to argue for it. I don't think you necessarily want it for most MM solutions, but I agree it will be useful for some scenarios. I think the ReorderBufferChange struct needs a better way to distinguish between old-key and old-tuple now, but I'd rather implement the facililty for logging the full old tuple in a separate patch. The patchset is big enough as is, lets not tack on more features. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-15 10:09:05 -0400, Robert Haas wrote: On Tue, Oct 15, 2013 at 9:17 AM, Andres Freund and...@2ndquadrant.com wrote: It allows you to use the shared libary both as a normal extension loaded via shared_preload_library or adhoc and as an output plugin which seems like a sensible goal. We could have a single _PG_init_output_plugin() symbol that fills in such a struct which would then not conflict with using the .so independently. If you prefer that I'll change things around. I think part of the problem may be that you're using the library name to identify the output plugin. I'm not excited about that design. For functions, you give the function a name and that is a pointer to where to actually find the function, which may be a 2-tuple library-name, function-name, or perhaps just a 1-tuple builtin-function-name, or maybe the whole text of a PL/pgsql procedure that should be compiled. That means you allow trivial remote code execution since you could try to load system() or something else that's available in every shared object. Now you can argue that that's OK since we have special checks for replication connections, but I'd rather not go there. Perhaps this ought to work similarly. Create a function in pg_proc which returns the structure containing the function pointers. Then, when that output plugin is selected, it'll automatically trigger loading the correct shared library if that's needed; and the shared library name may (but need not) match the output plugin name. I'd like to avoid relying on inserting stuff into pg_proc because that makes it harder to extract WAL from a HS standby. Requiring to configure that on the primary to extract data on the standby seems confusing to me. But perhaps that's the correct solution :/ Now, some users are still going to head for the hills. But at least from where I sit it sounds a hell of a lot better than the first answer. We're not going to solve all of the tooling problems around this technology in one release, for sure. But as far as 95% of our users are concerned, a C API might as well not exist at all. People WILL try to machine parse the output of whatever demo plugins we provide; so I think we should try hard to provide at least one such plugin that is designed to make that as easy as possible. Well, just providing the C API + an example in a first step didn't work out too badly for FDWs. I am pretty sure that once released there will soon be extensions for it on PGXN or whatever for special usecases. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Tue, Oct 15, 2013 at 10:27 AM, Andres Freund and...@2ndquadrant.com wrote: I think part of the problem may be that you're using the library name to identify the output plugin. I'm not excited about that design. For functions, you give the function a name and that is a pointer to where to actually find the function, which may be a 2-tuple library-name, function-name, or perhaps just a 1-tuple builtin-function-name, or maybe the whole text of a PL/pgsql procedure that should be compiled. That means you allow trivial remote code execution since you could try to load system() or something else that's available in every shared object. Now you can argue that that's OK since we have special checks for replication connections, but I'd rather not go there. Well, obviously you can't let somebody load any library they want. But that's pretty much true anyway; LOAD had better be confined to superusers unless there is something (like a pg_proc entry) that provides prior authorization for that specific load. Perhaps this ought to work similarly. Create a function in pg_proc which returns the structure containing the function pointers. Then, when that output plugin is selected, it'll automatically trigger loading the correct shared library if that's needed; and the shared library name may (but need not) match the output plugin name. I'd like to avoid relying on inserting stuff into pg_proc because that makes it harder to extract WAL from a HS standby. Requiring to configure that on the primary to extract data on the standby seems confusing to me. But perhaps that's the correct solution :/ That's a reasonable concern. I don't have another idea at the moment, unless we want to allow replication connections to issue LOAD commands. Then you can LOAD the library, so that the plug-in is registered under the well-known name you expect it to have, and then use that name to start replication. Now, some users are still going to head for the hills. But at least from where I sit it sounds a hell of a lot better than the first answer. We're not going to solve all of the tooling problems around this technology in one release, for sure. But as far as 95% of our users are concerned, a C API might as well not exist at all. People WILL try to machine parse the output of whatever demo plugins we provide; so I think we should try hard to provide at least one such plugin that is designed to make that as easy as possible. Well, just providing the C API + an example in a first step didn't work out too badly for FDWs. I am pretty sure that once released there will soon be extensions for it on PGXN or whatever for special usecases. I suspect so, too. But I also think that if that's the only thing available in the first release, a lot of users will get a poor initial impression. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-15 10:15:14 -0400, Robert Haas wrote: On Tue, Oct 15, 2013 at 9:47 AM, Andres Freund and...@2ndquadrant.com wrote: On 2013-10-15 15:17:58 +0200, Andres Freund wrote: If we go for CSV I think we should put the entire primary key as one column (containing all the columns) and the entire row another. What about columns like: * action B|I|U|D|C BEGIN and COMMIT? That's B and C, yes. You'd rather not have them? When would you replay the commit without an explicit message telling you to? Repeating the column names for every row strikes me as a nonstarter. [...] Sure, some people may want JSON or XML output that reiterates the labels every time, but for a lot of people that's going to greatly increase the size of the output and be undesirable for that reason. But I argue that most simpler users - which are exactly the ones a generic output plugin is aimed at - will want all column names since it makes replay far easier. If the plugin interface isn't rich enough to provide a convenient way to avoid that, then it needs to be fixed so that it is, because it will be a common requirement. Oh, it surely is possibly to avoid repeating it. The output plugin interface simply gives you a relcache entry, that contains everything necessary. The output plugin would need to keep track of whether it has output data for a specific relation and it would need to check whether the table definition has changed, but I don't see how we could avoid that? What still need to be determined is: * how do we separate and escape multiple values in one CSV column * how do we represent NULLs I consider the escaping a key design decision. Ideally, it should be something that's easy to reverse from a scripting language; ideally also, it should be something similar to how we handle COPY. These goals may be in conflict; we'll have to pick something. Note that parsing COPYs is a major PITA from most languages... Perhaps we should make the default output json instead? With every action terminated by a nullbyte? That's probably easier to parse from various scripting languages than anything else. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-15 10:34:53 -0400, Robert Haas wrote: On Tue, Oct 15, 2013 at 10:27 AM, Andres Freund and...@2ndquadrant.com wrote: I think part of the problem may be that you're using the library name to identify the output plugin. I'm not excited about that design. For functions, you give the function a name and that is a pointer to where to actually find the function, which may be a 2-tuple library-name, function-name, or perhaps just a 1-tuple builtin-function-name, or maybe the whole text of a PL/pgsql procedure that should be compiled. That means you allow trivial remote code execution since you could try to load system() or something else that's available in every shared object. Now you can argue that that's OK since we have special checks for replication connections, but I'd rather not go there. Well, obviously you can't let somebody load any library they want. But that's pretty much true anyway; LOAD had better be confined to superusers unless there is something (like a pg_proc entry) that provides prior authorization for that specific load. Currently you can create users that have permissions for replication but which are not superusers. I think we should strive to providing that capability for changeset extraction as well. Perhaps this ought to work similarly. Create a function in pg_proc which returns the structure containing the function pointers. Then, when that output plugin is selected, it'll automatically trigger loading the correct shared library if that's needed; and the shared library name may (but need not) match the output plugin name. I'd like to avoid relying on inserting stuff into pg_proc because that makes it harder to extract WAL from a HS standby. Requiring to configure that on the primary to extract data on the standby seems confusing to me. But perhaps that's the correct solution :/ That's a reasonable concern. I don't have another idea at the moment, unless we want to allow replication connections to issue LOAD commands. Then you can LOAD the library, so that the plug-in is registered under the well-known name you expect it to have, and then use that name to start replication. But what's the advantage of that over the current situation or one where PG_load_output_plugin() is called? The current and related implementations allow you to only load libraries in some designated postgres directories and it doesn't allow you to call any arbitrary functions in there. Would you be content with a symbol PG_load_output_plugin being called that fills out the actual callbacks? Now, some users are still going to head for the hills. But at least from where I sit it sounds a hell of a lot better than the first answer. We're not going to solve all of the tooling problems around this technology in one release, for sure. But as far as 95% of our users are concerned, a C API might as well not exist at all. People WILL try to machine parse the output of whatever demo plugins we provide; so I think we should try hard to provide at least one such plugin that is designed to make that as easy as possible. Well, just providing the C API + an example in a first step didn't work out too badly for FDWs. I am pretty sure that once released there will soon be extensions for it on PGXN or whatever for special usecases. I suspect so, too. But I also think that if that's the only thing available in the first release, a lot of users will get a poor initial impression. I think lots of people will expect a builtin logical replication solution :/. Which seems a tad unlikely to arrive in 9.4. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Tue, Oct 15, 2013 at 10:56 AM, Andres Freund and...@2ndquadrant.com wrote: That means you allow trivial remote code execution since you could try to load system() or something else that's available in every shared object. Now you can argue that that's OK since we have special checks for replication connections, but I'd rather not go there. Well, obviously you can't let somebody load any library they want. But that's pretty much true anyway; LOAD had better be confined to superusers unless there is something (like a pg_proc entry) that provides prior authorization for that specific load. Currently you can create users that have permissions for replication but which are not superusers. I think we should strive to providing that capability for changeset extraction as well. I agree. Perhaps this ought to work similarly. Create a function in pg_proc which returns the structure containing the function pointers. Then, when that output plugin is selected, it'll automatically trigger loading the correct shared library if that's needed; and the shared library name may (but need not) match the output plugin name. I'd like to avoid relying on inserting stuff into pg_proc because that makes it harder to extract WAL from a HS standby. Requiring to configure that on the primary to extract data on the standby seems confusing to me. But perhaps that's the correct solution :/ That's a reasonable concern. I don't have another idea at the moment, unless we want to allow replication connections to issue LOAD commands. Then you can LOAD the library, so that the plug-in is registered under the well-known name you expect it to have, and then use that name to start replication. But what's the advantage of that over the current situation or one where PG_load_output_plugin() is called? The current and related implementations allow you to only load libraries in some designated postgres directories and it doesn't allow you to call any arbitrary functions in there. Well, I've already said why I don't like conflating the library name and the plugin name. It rules out core plugins and libraries that provide multiple plugins. I don't have anything to add to that. Would you be content with a symbol PG_load_output_plugin being called that fills out the actual callbacks? Well, it doesn't fix the muddling of library names with output plugin names, but I suppose I'd find it a modest improvement. Well, just providing the C API + an example in a first step didn't work out too badly for FDWs. I am pretty sure that once released there will soon be extensions for it on PGXN or whatever for special usecases. I suspect so, too. But I also think that if that's the only thing available in the first release, a lot of users will get a poor initial impression. I think lots of people will expect a builtin logical replication solution :/. Which seems a tad unlikely to arrive in 9.4. Yep. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Tue, Oct 15, 2013 at 11:02:39AM -0400, Robert Haas wrote: goals may be in conflict; we'll have to pick something. Note that parsing COPYs is a major PITA from most languages... Perhaps we should make the default output json instead? With every action terminated by a nullbyte? That's probably easier to parse from various scripting languages than anything else. I could go for that. It's not quite as compact as I might hope, but JSON does seem to make people awfully happy. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company Feeding such a JSON stream into a compression algorithm like lz4 or snappy should result in a pretty compact stream. The latest lz4 updates also have ability to use a pre-existing dictionary which would really help remove the redundant pieces. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-15 11:02:39 -0400, Robert Haas wrote: If the plugin interface isn't rich enough to provide a convenient way to avoid that, then it needs to be fixed so that it is, because it will be a common requirement. Oh, it surely is possibly to avoid repeating it. The output plugin interface simply gives you a relcache entry, that contains everything necessary. The output plugin would need to keep track of whether it has output data for a specific relation and it would need to check whether the table definition has changed, but I don't see how we could avoid that? Well, it might be nice if there were a callback for, hey, schema has changed! Seems like a lot of plugins will want to know that for one reason or another, and rechecking for every tuple sounds expensive. I don't really see how we could provide that in any useful manner. We could provide a callback that is called whenever another transaction has changed the schema, but there's nothing easily to be done about schema changes by the replayed transaction itself. And those are the only ones where meaningful schema changes can happen since the locks the source transaction has held will prevent most other schema changes. As much as I hate such code, I guess checking (and possibly storing) the ctid||xmin of the pg_class row is the easiest thing we could do :(. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Tue, Oct 15, 2013 at 10:48 AM, Andres Freund and...@2ndquadrant.com wrote: What about columns like: * action B|I|U|D|C BEGIN and COMMIT? That's B and C, yes. You'd rather not have them? When would you replay the commit without an explicit message telling you to? No, BEGIN and COMMIT sounds good, actually. Just wanted to make sure I understood. Repeating the column names for every row strikes me as a nonstarter. [...] Sure, some people may want JSON or XML output that reiterates the labels every time, but for a lot of people that's going to greatly increase the size of the output and be undesirable for that reason. But I argue that most simpler users - which are exactly the ones a generic output plugin is aimed at - will want all column names since it makes replay far easier. Meh, maybe. If the plugin interface isn't rich enough to provide a convenient way to avoid that, then it needs to be fixed so that it is, because it will be a common requirement. Oh, it surely is possibly to avoid repeating it. The output plugin interface simply gives you a relcache entry, that contains everything necessary. The output plugin would need to keep track of whether it has output data for a specific relation and it would need to check whether the table definition has changed, but I don't see how we could avoid that? Well, it might be nice if there were a callback for, hey, schema has changed! Seems like a lot of plugins will want to know that for one reason or another, and rechecking for every tuple sounds expensive. What still need to be determined is: * how do we separate and escape multiple values in one CSV column * how do we represent NULLs I consider the escaping a key design decision. Ideally, it should be something that's easy to reverse from a scripting language; ideally also, it should be something similar to how we handle COPY. These goals may be in conflict; we'll have to pick something. Note that parsing COPYs is a major PITA from most languages... Perhaps we should make the default output json instead? With every action terminated by a nullbyte? That's probably easier to parse from various scripting languages than anything else. I could go for that. It's not quite as compact as I might hope, but JSON does seem to make people awfully happy. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 10/15/2013 07:56 AM, Andres Freund wrote: Well, just providing the C API + an example in a first step didn't work out too badly for FDWs. I am pretty sure that once released there will soon be extensions for it on PGXN or whatever for special usecases. I suspect so, too. But I also think that if that's the only thing available in the first release, a lot of users will get a poor initial impression. I think lots of people will expect a builtin logical replication solution :/. Which seems a tad unlikely to arrive in 9.4. Well, last I checked the Slony team is hard at work on building something which will be based on logical changesets. So there will likely be at least one tool available shortly after 9.4 is released. A good and flexible API is, IMHO, more important than having any finished solution. The whole reason why logical replication was outside core PG for so long is that replication systems have differing and mutually incompatible goals. A good API can support all of those goals; a user-level tool, no matter how good, can't. And, frankly, once the API is built, how hard will it be to write a script which does the simplest replication approach (replay all statements on slave)? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Tue, Oct 15, 2013 at 10:09:05AM -0400, Robert Haas wrote: On Tue, Oct 15, 2013 at 9:17 AM, Andres Freund and...@2ndquadrant.com wrote: User: So, what's new in PostgreSQL 9.4? Hacker: Well, now we have logical replication! User: Why is that cool? Hacker: Well, streaming replication is awesome for HA, but it has significant limitations. And trigger-based systems are very mature, but the overhead is high and their lack of core integration makes them hard to use. With this technology, you can build systems that will replicate individual tables or even parts of tables, multi-master systems, and lots of other cool stuff. User: Wow, that sounds great. How do I use it? Hacker: Well, first you write an output plugin in C using a special API. User: Hey, do you know whether the MongoDB guys came to this conference? Let's try that again. User: Wow, that sounds great. How do I use it? Hacker: Well, currently, the output gets dumped as a series of text files that are designed to be parsed using a scripting language. We have sample parsers written in Perl and Python that you can use as-is or hack up to meet your needs. My version: Hacker: the output gets dumped as a series of JSON files. We have docs for this rev of the format and examples of consumers in Perl and Python you can use as-is or hack up to meet your needs. Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Tue, Oct 15, 2013 at 7:09 AM, Robert Haas robertmh...@gmail.com wrote: Let's try that again. User: Wow, that sounds great. How do I use it? Hacker: Well, currently, the output gets dumped as a series of text files that are designed to be parsed using a scripting language. We have sample parsers written in Perl and Python that you can use as-is or hack up to meet your needs. Have you heard of multicorn? Plugin authors can write a wrapper that spits out JSON or whatever other thing they like, which can be consumed by non C-hackers. Now, some users are still going to head for the hills. But at least from where I sit it sounds a hell of a lot better than the first answer. We're not going to solve all of the tooling problems around this technology in one release, for sure. But as far as 95% of our users are concerned, a C API might as well not exist at all. People WILL try to machine parse the output of whatever demo plugins we provide; so I think we should try hard to provide at least one such plugin that is designed to make that as easy as possible. I agree that this is important, but I wouldn't like to weigh it too heavily. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Fri, Oct 11, 2013 at 12:57 PM, Andres Freund and...@2ndquadrant.com wrote: I don't see any need for SQL syntax. I was just thinking that the _PG_init function could fill in a structure and then call RegisterLogicalReplicationOutputPlugin(mystruct). Hm. We can do that, but what'd be the advantage of that? The current model will correctly handle things like a'shared_preload_libraries'ed output plugin, because its _PG_init() will not register it. With the handling done in _PG_init() there would be two. Being able to use the same .so for output plugin handling and some other replication solution specific stuff is imo useful. Well, I just think relying on specific symbol names in the .so file is kind of unfortunate. It means that, for example, you can't have multiple output plugins provided by a single .so. And in general I think it's something that we've tried to minimize. I don't see why you're so pessimistic about that. I know you haven't worked it out yet, but what makes this harder than sitting down and designing something? Because every replication solution has different requirements for the format and they will want filter the output stream with regard to their own configuration. E.g. bucardo will want to include the transaction timestamp for conflict resolution and such. But there's only so much information available here. Why not just have a format that logs it all? Sure, that's no problem. Do I understand correctly that you'd like wal_decoding: Add information about a tables primary key to struct RelationData wal_decoding: Add wal_level = logical and log data required for logical decoding earlier? Yes. I'd really like to do so. I am travelling atm, but I will be back tomorrow evening and will push an updated patch this weekend. The issue I know of in the latest patches at http://www.postgresql.org/message-id/20131007133232.ga15...@awork2.anarazel.de is renaming from http://www.postgresql.org/message-id/20131008194758.gb3718...@alap2.anarazel.de I'm a bit nervous about the way the combo CID logging. I would have thought that you would emit one record per combo CID, but what you're apparently doing is emitting one record per heap tuple that uses a combo CID. I thought and implemented that in the beginning. Unfortunately it's not enough :(. That's probably the issue that took me longest to understand in this patchseries... Combocids can only fix the case where a transaction actually has create a combocid: 1) TX1: INSERT id = 1 at 0/1: (xmin = 1, xmax=Invalid, cmin = 55, cmax = Invalid) 2) TX2: DELETE id = 1 at 0/1: (xmin = 1, xmax=2, cmin = Invalid, cmax = 1) So, if we're decoding data that needs to lookup those rows in TX1 or TX2 we both times need access to cmin and cmax, but neither transaction will have created a multixact. That can only be an issue in transaction with catalog modifications. Oh, yuck. So that means you have to write an extra WAL record for EVERY heap insert, update, or delete to a catalog table? OUCH. Couldn't measure anything either, which is not surprising that I couldn't measure the overhead in the first place. I've done some parallel INSERT/DELETE pgbenching around the wal_level=logical and I couldn't measure any overhead with it disabled. With wal_level = logical, UPDATEs and DELETEs do get a bit slower, but that's to be expected. It'd probably not hurt to redo those benchmarks to make sure... Yes, I think it would be good to characterize it more precisely than a bit, so people know what to expect. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-14 09:36:03 -0400, Robert Haas wrote: On Fri, Oct 11, 2013 at 12:57 PM, Andres Freund and...@2ndquadrant.com wrote: I don't see any need for SQL syntax. I was just thinking that the _PG_init function could fill in a structure and then call RegisterLogicalReplicationOutputPlugin(mystruct). Hm. We can do that, but what'd be the advantage of that? The current model will correctly handle things like a'shared_preload_libraries'ed output plugin, because its _PG_init() will not register it. With the handling done in _PG_init() there would be two. Being able to use the same .so for output plugin handling and some other replication solution specific stuff is imo useful. Well, I just think relying on specific symbol names in the .so file is kind of unfortunate. It means that, for example, you can't have multiple output plugins provided by a single .so. And in general I think it's something that we've tried to minimize. But that's not really different when you rely on _PG_init doing it's thing, right? I don't see why you're so pessimistic about that. I know you haven't worked it out yet, but what makes this harder than sitting down and designing something? Because every replication solution has different requirements for the format and they will want filter the output stream with regard to their own configuration. E.g. bucardo will want to include the transaction timestamp for conflict resolution and such. But there's only so much information available here. Why not just have a format that logs it all? Because we do not know what all is? Also, how would we handle replication sets and such that all of the existing replication solutions have generically? Sure, that's no problem. Do I understand correctly that you'd like wal_decoding: Add information about a tables primary key to struct RelationData wal_decoding: Add wal_level = logical and log data required for logical decoding earlier? Yes. That's done. Hope the new order makes sense. So, if we're decoding data that needs to lookup those rows in TX1 or TX2 we both times need access to cmin and cmax, but neither transaction will have created a multixact. That can only be an issue in transaction with catalog modifications. Oh, yuck. So that means you have to write an extra WAL record for EVERY heap insert, update, or delete to a catalog table? OUCH. Yes. We could integrate it into the main record without too many problems, but it didn't seem like an important optimization and it would have higher chances of slowing down wal_level logical. It'd probably not hurt to redo those benchmarks to make sure... Yes, I think it would be good to characterize it more precisely than a bit, so people know what to expect. A bit was below the 3% range for loops of adding columns. So, any tests you'd like to see? * loop around CREATE TABLE/DROP TABLE * loop around ALTER TABLE ... ADD COLUMN * loop around CREATE FUNCTION/DROP FUNCTION Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-14 15:51:14 +0200, Andres Freund wrote: It'd probably not hurt to redo those benchmarks to make sure... Yes, I think it would be good to characterize it more precisely than a bit, so people know what to expect. A bit was below the 3% range for loops of adding columns. So, any tests you'd like to see? * loop around CREATE TABLE/DROP TABLE * loop around ALTER TABLE ... ADD COLUMN * loop around CREATE FUNCTION/DROP FUNCTION So, see the attatched benchmark skript. I've always done using a disk bound and a memory bound (using eatmydata, preventing fsyncs) run. * unpatched run, wal_level = hot_standby, eatmydata * unpatched run, wal_level = hot_standby * patched run, wal_level = hot_standby, eatmydata * patched run, wal_level = hot_standby * patched run, wal_level = logical, eatmydata * patched run, wal_level = logical Based on those results, there's no difference above noise for wal_level=hot_standby, with or without the patch. With wal_level=logical there's a measurable increase in wal traffic (~12-17%), but no performance decrease above noise. From my POV that's ok, those are really crazy catalog workloads. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services $ eatmydata /tmp/bench.sh createtable single tps = 320.293146 (including connections establishing) tps = 320.334981 (excluding connections establishing) 262152 /tmp/perf/pg_xlog createtable parallel tps = 1516.095491 (including connections establishing) tps = 1516.488664 (excluding connections establishing) 1163272 /tmp/perf/pg_xlog altertable single tps = 139.474141 (including connections establishing) tps = 139.491721 (excluding connections establishing) 163848 /tmp/perf/pg_xlog altertable parallel tps = 774.868029 (including connections establishing) tps = 775.136842 (excluding connections establishing) 851976 /tmp/perf/pg_xlog createfunction single tps = 2938.380801 (including connections establishing) tps = 2938.711692 (excluding connections establishing) 81928 /tmp/perf/pg_xlog createfunction parallel tps = 20603.023567 (including connections establishing) tps = 20608.799987 (excluding connections establishing) 458760 /tmp/perf/pg_xlog $ /tmp/bench.sh createtable single tps = 51.014096 (including connections establishing) tps = 51.020329 (excluding connections establishing) 49160 /tmp/perf/pg_xlog createtable parallel tps = 171.012045 (including connections establishing) tps = 171.054406 (excluding connections establishing) 147464 /tmp/perf/pg_xlog altertable single tps = 9.138758 (including connections establishing) tps = 9.139863 (excluding connections establishing) 32776 /tmp/perf/pg_xlog altertable parallel tps = 45.109269 (including connections establishing) tps = 45.122066 (excluding connections establishing) 65544 /tmp/perf/pg_xlog createfunction single tps = 131.192719 (including connections establishing) tps = 131.209112 (excluding connections establishing) 16392 /tmp/perf/pg_xlog createfunction parallel tps = 624.830173 (including connections establishing) tps = 625.017525 (excluding connections establishing) 32776 /tmp/perf/pg_xlog -- patch applied -- $ eatmydata /tmp/bench.sh createtable single tps = 329.063474 (including connections establishing) tps = 329.104147 (excluding connections establishing) 262152 /tmp/perf/pg_xlog createtable parallel tps = 1462.524932 (including connections establishing) tps = 1462.872552 (excluding connections establishing) 1130504 /tmp/perf/pg_xlog altertable single tps = 141.091905 (including connections establishing) tps = 141.108352 (excluding connections establishing) 163848 /tmp/perf/pg_xlog altertable parallel tps = 812.810544 (including connections establishing) tps = 813.026734 (excluding connections establishing) 901128 /tmp/perf/pg_xlog createfunction single tps = 3384.068190 (including connections establishing) tps = 3384.460953 (excluding connections establishing) 81928 /tmp/perf/pg_xlog createfunction parallel tps = 20744.363972 (including connections establishing) tps = 20750.135978 (excluding connections establishing) 458760 /tmp/perf/pg_xlog $ /tmp/bench.sh createtable single tps = 42.522505 (including connections establishing) tps = 42.527233 (excluding connections establishing) 49160 /tmp/perf/pg_xlog createtable parallel tps = 148.753762 (including connections establishing) tps = 148.794361 (excluding connections establishing) 131080 /tmp/perf/pg_xlog altertable single tps = 9.524616 (including connections establishing) tps = 9.525757 (excluding connections establishing) 32776 /tmp/perf/pg_xlog altertable parallel tps = 49.209278 (including connections establishing) tps = 49.223312 (excluding connections establishing) 65544 /tmp/perf/pg_xlog createfunction single tps = 132.325526 (including connections establishing) tps = 132.340677 (excluding connections establishing) 16392 /tmp/perf/pg_xlog createfunction parallel
Re: [HACKERS] logical changeset generation v6.2
On Thu, Oct 10, 2013 at 7:11 PM, Andres Freund and...@2ndquadrant.com wrote: Hi Robert, On 2013-10-09 14:49:46 -0400, Robert Haas wrote: I spent some time looking at the sample plugin (patch 9/12). Here are some review comments: - I think that the decoding plugin interface should work more like the foreign data wrapper interface. Instead of using pg_dlsym to look up fixed names, I think there should be a struct of function pointers that gets filled in and registered somehow. You mean something like CREATE OUTPUT PLUGIN registering a function with an INTERNAL return value returning a filled struct? I thought about that, but it seemed more complex. Happy to change it though if it's preferred. I don't see any need for SQL syntax. I was just thinking that the _PG_init function could fill in a structure and then call RegisterLogicalReplicationOutputPlugin(mystruct). - Still wondering how we'll use this from a bgworker. Simplified code to consume data: Cool. As long as that use case is supported, I'm happy; I just want to make sure we're not presuming that there must be an external client. - The output format doesn't look very machine-parseable. I really think we ought to provide something that is. Maybe a CSV-like format, or maybe something else, but I don't see why someone who wants to do change logging should be forced to write and install C code. If something like Bucardo can run on an unmodified system and extract change-sets this way without needing a .so file, that's going to be a huge win for usability. We can change the current format but I really see little to no chance of agreeing on a replication format that's serviceable to several solutions short term. Once we've gained some experience - maybe even this cycle - that might be different. I don't see why you're so pessimistic about that. I know you haven't worked it out yet, but what makes this harder than sitting down and designing something? More generally on this patch set, if I'm going to be committing any of this, I'd prefer to start with what is currently patches 3 and 4, once we reach agreement on those. Sounds like a reasonable start. Perhaps you could reshuffle the order of the series, if it's not too much work. Are we hoping to get any of this committed for this CF? If so, let's make a plan to get that done; time is short. If not, let's update the CF app accordingly. I'd really like to do so. I am travelling atm, but I will be back tomorrow evening and will push an updated patch this weekend. The issue I know of in the latest patches at http://www.postgresql.org/message-id/20131007133232.ga15...@awork2.anarazel.de is renaming from http://www.postgresql.org/message-id/20131008194758.gb3718...@alap2.anarazel.de I'm a bit nervous about the way the combo CID logging. I would have thought that you would emit one record per combo CID, but what you're apparently doing is emitting one record per heap tuple that uses a combo CID. For some reason that feels like an abuse (and maybe kinda inefficient, too). Either way, I also wonder what happens if a (logical?) checkpoint occurs between the combo CID record and the heap record to which it refers, or how you prevent that from happening. What if the combo CID record is written and the transaction aborts before writing the heap record (maybe without writing an abort record to WAL)? What are the performance implications of this additional WAL logging? What's the worst case? What's the typical case? Does it have a noticeable overhead when wal_level logical? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
Hi, On 2013-10-11 09:08:43 -0400, Robert Haas wrote: On Thu, Oct 10, 2013 at 7:11 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-10-09 14:49:46 -0400, Robert Haas wrote: I spent some time looking at the sample plugin (patch 9/12). Here are some review comments: - I think that the decoding plugin interface should work more like the foreign data wrapper interface. Instead of using pg_dlsym to look up fixed names, I think there should be a struct of function pointers that gets filled in and registered somehow. You mean something like CREATE OUTPUT PLUGIN registering a function with an INTERNAL return value returning a filled struct? I thought about that, but it seemed more complex. Happy to change it though if it's preferred. I don't see any need for SQL syntax. I was just thinking that the _PG_init function could fill in a structure and then call RegisterLogicalReplicationOutputPlugin(mystruct). Hm. We can do that, but what'd be the advantage of that? The current model will correctly handle things like a'shared_preload_libraries'ed output plugin, because its _PG_init() will not register it. With the handling done in _PG_init() there would be two. Being able to use the same .so for output plugin handling and some other replication solution specific stuff is imo useful. - Still wondering how we'll use this from a bgworker. Simplified code to consume data: Cool. As long as that use case is supported, I'm happy; I just want to make sure we're not presuming that there must be an external client. The included testcases are written using the SQL SRF interface, which in turn is a usecase that doesn't use walsenders and such, so I hope we won't break it accidentally ;) - The output format doesn't look very machine-parseable. I really think we ought to provide something that is. Maybe a CSV-like format, or maybe something else, but I don't see why someone who wants to do change logging should be forced to write and install C code. If something like Bucardo can run on an unmodified system and extract change-sets this way without needing a .so file, that's going to be a huge win for usability. We can change the current format but I really see little to no chance of agreeing on a replication format that's serviceable to several solutions short term. Once we've gained some experience - maybe even this cycle - that might be different. I don't see why you're so pessimistic about that. I know you haven't worked it out yet, but what makes this harder than sitting down and designing something? Because every replication solution has different requirements for the format and they will want filter the output stream with regard to their own configuration. E.g. bucardo will want to include the transaction timestamp for conflict resolution and such. More generally on this patch set, if I'm going to be committing any of this, I'd prefer to start with what is currently patches 3 and 4, once we reach agreement on those. Sounds like a reasonable start. Perhaps you could reshuffle the order of the series, if it's not too much work. Sure, that's no problem. Do I understand correctly that you'd like wal_decoding: Add information about a tables primary key to struct RelationData wal_decoding: Add wal_level = logical and log data required for logical decoding earlier? I'd really like to do so. I am travelling atm, but I will be back tomorrow evening and will push an updated patch this weekend. The issue I know of in the latest patches at http://www.postgresql.org/message-id/20131007133232.ga15...@awork2.anarazel.de is renaming from http://www.postgresql.org/message-id/20131008194758.gb3718...@alap2.anarazel.de I'm a bit nervous about the way the combo CID logging. I would have thought that you would emit one record per combo CID, but what you're apparently doing is emitting one record per heap tuple that uses a combo CID. I thought and implemented that in the beginning. Unfortunately it's not enough :(. That's probably the issue that took me longest to understand in this patchseries... Combocids can only fix the case where a transaction actually has create a combocid: 1) TX1: INSERT id = 1 at 0/1: (xmin = 1, xmax=Invalid, cmin = 55, cmax = Invalid) 2) TX2: DELETE id = 1 at 0/1: (xmin = 1, xmax=2, cmin = Invalid, cmax = 1) So, if we're decoding data that needs to lookup those rows in TX1 or TX2 we both times need access to cmin and cmax, but neither transaction will have created a multixact. That can only be an issue in transaction with catalog modifications. A slightly more complex variant also requires this if combocids are involved: 1) TX1: INSERT id = 1 at 0/1: (xmin = 1, xmax=Invalid, cmin = 55, cmax = Invalid) 2) TX1: SAVEPOINT foo; 3) TX1-2: UPDATE id = 1 at 0/1: (xmin = 1, xmax=2, cmin = 55, cmax = 56, combo=123) new at 0/1: (xmin = 2, xmax=Invalid, cmin = 57, cmax =
Re: [HACKERS] logical changeset generation v6.2
Hi Robert, On 2013-10-09 14:49:46 -0400, Robert Haas wrote: I spent some time looking at the sample plugin (patch 9/12). Here are some review comments: - I think that the decoding plugin interface should work more like the foreign data wrapper interface. Instead of using pg_dlsym to look up fixed names, I think there should be a struct of function pointers that gets filled in and registered somehow. You mean something like CREATE OUTPUT PLUGIN registering a function with an INTERNAL return value returning a filled struct? I thought about that, but it seemed more complex. Happy to change it though if it's preferred. - pg_decode_init() only warns when it encounters an unknown option. An error seems more appropriate. Fine with me. I think I just made it a warning because I wanted to experiment with options. - Still wondering how we'll use this from a bgworker. Simplified code to consume data: LogicalDecodingReAcquireSlot(NameStr(*name)); ctx = CreateLogicalDecodingContext(MyLogicalDecodingSlot, false /* not initial call */, MyLogicalDecodingSlot-confirmed_flush, options, logical_read_local_xlog_page, LogicalOutputPrepareWrite, LogicalOutputWrite); ... while (true) { XLogRecord *record; char *errm = NULL; record = XLogReadRecord(ctx-reader, startptr, errm); ... DecodeRecordIntoReorderBuffer(ctx, buf); } /* at the end or better ever commit or such */ LogicalConfirmReceivedLocation(/* whatever you consumed */); LogicalDecodingReleaseSlot(); - The output format doesn't look very machine-parseable. I really think we ought to provide something that is. Maybe a CSV-like format, or maybe something else, but I don't see why someone who wants to do change logging should be forced to write and install C code. If something like Bucardo can run on an unmodified system and extract change-sets this way without needing a .so file, that's going to be a huge win for usability. We can change the current format but I really see little to no chance of agreeing on a replication format that's serviceable to several solutions short term. Once we've gained some experience - maybe even this cycle - that might be different. More generally on this patch set, if I'm going to be committing any of this, I'd prefer to start with what is currently patches 3 and 4, once we reach agreement on those. Sounds like a reasonable start. Are we hoping to get any of this committed for this CF? If so, let's make a plan to get that done; time is short. If not, let's update the CF app accordingly. I'd really like to do so. I am travelling atm, but I will be back tomorrow evening and will push an updated patch this weekend. The issue I know of in the latest patches at http://www.postgresql.org/message-id/20131007133232.ga15...@awork2.anarazel.de is renaming from http://www.postgresql.org/message-id/20131008194758.gb3718...@alap2.anarazel.de Do you know of anything else in the patches you're referring to? Thanks, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On Mon, Sep 30, 2013 at 6:44 PM, Andres Freund and...@2ndquadrant.com wrote: The series from friday was a bit too buggy - obviously I was too tired. So here's a new one: * fix pg_recvlogical makefile (Thanks Steve) * fix two commits not compiling properly without later changes (Thanks Kevin) * keep track of commit timestamps * fix bugs with option passing in test_logical_decoding * actually parse option values in test_decoding instead of just using the option name * don't use anonymous structs in unions. That's compiler specific (msvc and gcc) before C11 on which we can't rely. That unfortunately will break output plugins because ReorderBufferChange need to qualify old/new tuples now * improve error handling/cleanup in test_logical_decoding * some minor cleanups Patches attached, git tree updated. I spent some time looking at the sample plugin (patch 9/12). Here are some review comments: - I think that the decoding plugin interface should work more like the foreign data wrapper interface. Instead of using pg_dlsym to look up fixed names, I think there should be a struct of function pointers that gets filled in and registered somehow. - pg_decode_init() only warns when it encounters an unknown option. An error seems more appropriate. - Still wondering how we'll use this from a bgworker. - The output format doesn't look very machine-parseable. I really think we ought to provide something that is. Maybe a CSV-like format, or maybe something else, but I don't see why someone who wants to do change logging should be forced to write and install C code. If something like Bucardo can run on an unmodified system and extract change-sets this way without needing a .so file, that's going to be a huge win for usability. Other than that, I don't have too many concerns about the plugin interface. I think it provides useful flexibility and it generally seems well-designed. I hope in the future we'll be able to decode transactions on the fly instead of waiting until commit time, but I've resigned myself to the fact that we may not get that in version one. More generally on this patch set, if I'm going to be committing any of this, I'd prefer to start with what is currently patches 3 and 4, once we reach agreement on those. Are we hoping to get any of this committed for this CF? If so, let's make a plan to get that done; time is short. If not, let's update the CF app accordingly. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 10/03/2013 04:00 PM, Andres Freund wrote: Ok, there were a couple of bugs because I thought mxacts wouldn't need to be supported. So far your testcase doesn't crash the database anymore - it spews some internal errors though, so I am not sure if it's entirely fixed for you. Thanks for testing and helping! I've pushed the changes to the git tree, they aren't squashed yet and there's some further outstanding stuff, so I won't repost the series yet. Greetings, Andres Freund When I run your updated version (from friday, not what you posted today) against a more recent version of my slony changes I can get the test case to pass 2/3 'rd of the time. The failures are due to an issue in slon itself that I need to fix. I see lots of 0LOG: tx with subtxn 58836 but they seem harmless. Thanks -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-07 09:56:11 -0400, Steve Singer wrote: On 10/03/2013 04:00 PM, Andres Freund wrote: Ok, there were a couple of bugs because I thought mxacts wouldn't need to be supported. So far your testcase doesn't crash the database anymore - it spews some internal errors though, so I am not sure if it's entirely fixed for you. Thanks for testing and helping! I've pushed the changes to the git tree, they aren't squashed yet and there's some further outstanding stuff, so I won't repost the series yet. Greetings, Andres Freund When I run your updated version (from friday, not what you posted today) against a more recent version of my slony changes I can get the test case to pass 2/3 'rd of the time. The failures are due to an issue in slon itself that I need to fix. Cool. I see lots of 0LOG: tx with subtxn 58836 Yes, those are completely harmless. And should, in fact, be removed. I guess I should add the todo entry: * make a pass over all elog/ereport an make sure they have the correct log level et al. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-01 16:11:47 -0400, Steve Singer wrote: On 09/30/2013 06:44 PM, Andres Freund wrote: Hi, The series from friday was a bit too buggy - obviously I was too tired. So here's a new one: With this series I've also noticed #2 0x007741a7 in ExceptionalCondition ( conditionName=conditionName@entry=0x7c2908 !(!(tuple-t_infomask 0x1000)), errorType=errorType@entry=0x7acc70 FailedAssertion, fileName=fileName@entry=0x91767e tqual.c, lineNumber=lineNumber@entry=1608) at assert.c:54 54abort(); 0x007a4432 in HeapTupleSatisfiesMVCCDuringDecoding ( htup=0x10bfe48, snapshot=0x108b3d8, buffer=310) at tqual.c:1608 #4 0x0049d6b7 in heap_hot_search_buffer (tid=tid@entry=0x10bfe4c, relation=0x7fbebbcd89c0, buffer=310, snapshot=0x10bfda0, heapTuple=heapTuple@entry=0x10bfe48, all_dead=all_dead@entry=0x7fff4aa3866f \001\370\375\v\001, first_call=1 '\001') at heapam.c:1756 #5 0x004a8174 in index_fetch_heap (scan=scan@entry=0x10bfdf8) at indexam.c:539 #6 0x004a82a8 in index_getnext (scan=0x10bfdf8, direction=direction@entry=ForwardScanDirection) at indexam.c:622 #7 0x004a6fa9 in systable_getnext (sysscan=sysscan@entry=0x10bfd48) at genam.c:343 #8 0x0076df40 in RelidByRelfilenode (reltablespace=0, relfilenode=529775) at relfilenodemap.c:214 ---Type return to continue, or q return to quit--- #9 0x00664ad7 in ReorderBufferCommit (rb=0x1082d98, xid=optimized out, commit_lsn=4638756800, end_lsn=optimized out, commit_time=commit_time@entry=433970378426176) at reorderbuffer.c:1320 Does your code use SELECT FOR UPDATE/SHARE on system or treat_as_catalog tables? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 10/03/2013 12:38 PM, Andres Freund wrote: Does your code use SELECT FOR UPDATE/SHARE on system or treat_as_catalog tables? Greetings, Andres Freund Yes. It declares sl_table and sl_sequence and sl_set as catalog. It does a SELECT .. from @NAMESPACE@.sl_table T, @NAMESPACE@.sl_set S, pg_catalog.pg_class PGC, pg_catalog.pg_namespace PGN, pg_catalog.pg_index PGX, pg_catalog.pg_class PGXC where ... for update in the code being executed by the 'set add table'. (We also do select for update commands in many other places during cluster configuration commands) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 2013-10-03 13:03:07 -0400, Steve Singer wrote: On 10/03/2013 12:38 PM, Andres Freund wrote: Does your code use SELECT FOR UPDATE/SHARE on system or treat_as_catalog tables? Greetings, Andres Freund Yes. It declares sl_table and sl_sequence and sl_set as catalog. It does a SELECT .. from @NAMESPACE@.sl_table T, @NAMESPACE@.sl_set S, pg_catalog.pg_class PGC, pg_catalog.pg_namespace PGN, pg_catalog.pg_index PGX, pg_catalog.pg_class PGXC where ... for update in the code being executed by the 'set add table'. (We also do select for update commands in many other places during cluster configuration commands) Ok, there were a couple of bugs because I thought mxacts wouldn't need to be supported. So far your testcase doesn't crash the database anymore - it spews some internal errors though, so I am not sure if it's entirely fixed for you. Thanks for testing and helping! I've pushed the changes to the git tree, they aren't squashed yet and there's some further outstanding stuff, so I won't repost the series yet. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.2
On 09/30/2013 06:44 PM, Andres Freund wrote: Hi, The series from friday was a bit too buggy - obviously I was too tired. So here's a new one: With this series I've also noticed #2 0x007741a7 in ExceptionalCondition ( conditionName=conditionName@entry=0x7c2908 !(!(tuple-t_infomask 0x1000)), errorType=errorType@entry=0x7acc70 FailedAssertion, fileName=fileName@entry=0x91767e tqual.c, lineNumber=lineNumber@entry=1608) at assert.c:54 54abort(); 0x007a4432 in HeapTupleSatisfiesMVCCDuringDecoding ( htup=0x10bfe48, snapshot=0x108b3d8, buffer=310) at tqual.c:1608 #4 0x0049d6b7 in heap_hot_search_buffer (tid=tid@entry=0x10bfe4c, relation=0x7fbebbcd89c0, buffer=310, snapshot=0x10bfda0, heapTuple=heapTuple@entry=0x10bfe48, all_dead=all_dead@entry=0x7fff4aa3866f \001\370\375\v\001, first_call=1 '\001') at heapam.c:1756 #5 0x004a8174 in index_fetch_heap (scan=scan@entry=0x10bfdf8) at indexam.c:539 #6 0x004a82a8 in index_getnext (scan=0x10bfdf8, direction=direction@entry=ForwardScanDirection) at indexam.c:622 #7 0x004a6fa9 in systable_getnext (sysscan=sysscan@entry=0x10bfd48) at genam.c:343 #8 0x0076df40 in RelidByRelfilenode (reltablespace=0, relfilenode=529775) at relfilenodemap.c:214 ---Type return to continue, or q return to quit--- #9 0x00664ad7 in ReorderBufferCommit (rb=0x1082d98, xid=optimized out, commit_lsn=4638756800, end_lsn=optimized out, commit_time=commit_time@entry=433970378426176) at reorderbuffer.c:1320 In addition to some of the other ones I've posted about. * fix pg_recvlogical makefile (Thanks Steve) * fix two commits not compiling properly without later changes (Thanks Kevin) * keep track of commit timestamps * fix bugs with option passing in test_logical_decoding * actually parse option values in test_decoding instead of just using the option name * don't use anonymous structs in unions. That's compiler specific (msvc and gcc) before C11 on which we can't rely. That unfortunately will break output plugins because ReorderBufferChange need to qualify old/new tuples now * improve error handling/cleanup in test_logical_decoding * some minor cleanups Patches attached, git tree updated. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers