On Mon, Jun 25, 2012 at 3:17 PM, Andres Freund <and...@2ndquadrant.com> wrote: >> I suppose the main reason we haven't done it already is that it >> increases the period of time during which we're using 2X the disk >> space. > I find that an acceptable price if its optional. Making it such doesn't seem > to be a problem for me.
+1. >> I think there is absolutely nothing wrong with doing extra things in >> ALTER TABLE when logical replication is enabled. We've got code >> that's conditional on Hot Standby being enabled in many places in the >> system; why should logical replication be any different? If we set >> the bar for logical replication at "the system can't do anything >> differently when logical replication is enabled" then I cheerfully >> submit that we are doomed. You've already made WAL format changes to >> support logging the pre-image of the tuple, which is a hundred times >> more likely to cause a performance problem than any monkeying around >> we might want to do in ALTER TABLE. >> >> I am deeply skeptical that we need to look inside of transactions that >> do full-table rewrites. But even if we do, I don't see that what I'm >> proposing precludes it. For example, I think we could have ALTER >> TABLE emit WAL records specifically for logical replication that allow >> us to disentangle which tuple descriptor to use at which point in the >> transaction. I don't see that that would even be very difficult to >> set up. > Sorry, I was imprecise above: I have no problem doing some changes during > ALTER TABLE if logical rep is enabled. I am worried though that to make that > robust you would need loads of places that emit additional information: > * ALTER TABLE > * ALTER FUNCTIION > * ALTER OPERATOR > * ALTER/CREATE CAST > * TRUNCATE > * CLUSTER > * ... > > I have the feeling that would we want to do that the full amount of required > information would be rather high and end up being essentially the catalog. > Just having an intermediate tupledesc doesn't help that much if you have e.g. > record_out doing type lookups of its own. > > There also is the issue you have talked about before, that a user-type might > depend on values in other tables. Unless were ready to break at least > transactional behaviour for those for now...) I don't see how decoding outside > of the transaction is ever going to be valid? I wouldn't have a big problem > declaring that as broken for now... I've been thinking about this a lot. My thinking's still evolving somewhat, but consider the following case. A user defines a type T with an input function I and and output function O. They create a table which uses type T and insert a bunch of data, which is duly parsed using I; then, they replace I with a new input function I' and O with a new output function O'. Now, clearly, if we process the inserts using the catalogs that were in effect at the time the inserts we're done, we could theoretically get different output than if we use the catalogs that were in effect after the I/O functions were replaced. But is the latter output wrong, or merely different? My first thought when we started talking about this was "it's wrong", but the more I think about it, the less convinced I am... ...because it can't possibly be right to suppose that it's impossible to decode heap tuples using any catalog contents other than the ones that were in effect at the time the tuples got inserted. If that were true, then we wouldn't be able to read a table after adding or dropping a column, which of course we can. It seems to me that it might be sufficient to guarantee that we'll decode using the same *types* that were in effect at the time the inserts happened. If the user yanks the rug out from under us by changing the type definition, maybe we simply define that as a situation in which they get to keep both pieces. After all, if you replace the type definition in a way that makes sensible decoding of the table impossible, you've pretty much shot yourself in the foot whether logical replication enters the picture or not. If the enum case, for example, we go to great pains to make sure that the table contents are always decodable not only under the current version of SnapshotNow, but also any successor version. We do that by prohibiting ALTER TYPE .. ADD VALUE from running inside a transaction block - because if we inserted a row into pg_enum and then inserted dependent rows into some user table, a rollback could leave us with rows that we can't decode. For the same reason, we don't allow ALTER TYPE .. DROP VALUE. I think that we can infer a general principle from this: while I/O functions may refer to catalog contents, they may not do so in a way that could be invalidated by subsequent commits or rollbacks. If they did, they'd be breaking the ability of subsequent SELECT statements to read the table. An interesting case that is arguably an exception to this rule is that regwhatever types, which will cheerfully output their value as an OID if it can't be decoded to text, but will bail on input if the textual input can't be mapped to an OID via the appropriate system catalog. The results don't even obey transaction semantics: rhaas=# create table t (a int); CREATE TABLE rhaas=# begin; BEGIN rhaas=# select 't'::regclass; regclass ---------- t (1 row) -- at this point, drop table t in another session rhaas=# select 't'::regclass; regclass ---------- t (1 row) But: rhaas=# create table t (a int); CREATE TABLE rhaas=# begin; BEGIN -- at this point, drop table t in another session rhaas=# select 't'::regclass; ERROR: relation "t" does not exist at character 8 STATEMENT: select 't'::regclass; ERROR: relation "t" does not exist LINE 1: select 't'::regclass; ^ So in this case, the I/O functions are dependent not only on the contents of the system catalogs and what's visible under SnapshotNow, but also on the contents of the backend-local system caches and the exact timing of calls to AcceptInvalidationMessages() with respect to concurrent activities in other sessions. There is no way you're gonna be able to figure that out from the WAL stream, which means that there's no possible way to ensure that decode-to-text produces the same value that the user actually typed (never mind that he could have specified the input value as either an integer or a table name). And even if you could, the replication transaction could fail during reencoding on the remote node. To replicate this at all, you'd probably have to decide on replication the underlying OID and forget about the text representation... but it wouldn't bother me very much to say "oh, that's a kooky internal type, you can't use it with logical replication". Unless there's some way of making sure that all the remote OID assignments are the same as the local ones, it's not going to be that meaningful anyway. Even if you had such a thing, the fact that the behavior can depend on leftover syscache contents means that such a type might break under parallel apply. So to get back to my main point: how about decreeing that logical rep's responsibilities extend only to using the correct set of type OIDs to decode the data, placing the burden of not whacking around the type definitions or any ancillary tables on the user? That seems to reduce the complexity of the problem scenarios quite a lot, and also seems generally sane. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers