Hi, It took me far longer than I planned, its not finished, but time is running out. I would like some feedback that I am not going astray at this point... *I* think the general approach is sound and a good way forward that provides the basic infrastructure for many (all?) of the scenarios we talked about before.
Anyway, here is my next attempt at $TOPIC. Lets start with a quick demo (via psql): /* just so we keep a sensible xmin horizon */ ROLLBACK PREPARED 'f'; BEGIN; CREATE TABLE keepalive(); PREPARE TRANSACTION 'f'; DROP TABLE IF EXISTS replication_example; SELECT pg_current_xlog_insert_location(); CHECKPOINT; CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text varchar(120)); begin; INSERT INTO replication_example(somedata, text) VALUES (1, 1); INSERT INTO replication_example(somedata, text) VALUES (1, 2); commit; ALTER TABLE replication_example ADD COLUMN bar int; INSERT INTO replication_example(somedata, text, bar) VALUES (2, 1, 4); BEGIN; INSERT INTO replication_example(somedata, text, bar) VALUES (2, 2, 4); INSERT INTO replication_example(somedata, text, bar) VALUES (2, 3, 4); INSERT INTO replication_example(somedata, text, bar) VALUES (2, 4, NULL); commit; ALTER TABLE replication_example DROP COLUMN bar; INSERT INTO replication_example(somedata, text) VALUES (3, 1); BEGIN; INSERT INTO replication_example(somedata, text) VALUES (3, 2); INSERT INTO replication_example(somedata, text) VALUES (3, 3); commit; ALTER TABLE replication_example RENAME COLUMN text TO somenum; INSERT INTO replication_example(somedata, somenum) VALUES (4, 1); ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING (somenum::int4); INSERT INTO replication_example(somedata, somenum) VALUES (5, 1); SELECT pg_current_xlog_insert_location(); ---- Somewhat later ---- SELECT decode_xlog('0/1893D78', '0/18BE398'); WARNING: BEGIN WARNING: COMMIT WARNING: BEGIN WARNING: tuple is: id[int4]:1 somedata[int4]:1 text[varchar]:1 WARNING: tuple is: id[int4]:2 somedata[int4]:1 text[varchar]:2 WARNING: COMMIT WARNING: BEGIN WARNING: COMMIT WARNING: BEGIN WARNING: tuple is: id[int4]:3 somedata[int4]:2 text[varchar]:1 bar[int4]:4 WARNING: COMMIT WARNING: BEGIN WARNING: tuple is: id[int4]:4 somedata[int4]:2 text[varchar]:2 bar[int4]:4 WARNING: tuple is: id[int4]:5 somedata[int4]:2 text[varchar]:3 bar[int4]:4 WARNING: tuple is: id[int4]:6 somedata[int4]:2 text[varchar]:4 bar[int4]: (null) WARNING: COMMIT WARNING: BEGIN WARNING: COMMIT WARNING: BEGIN WARNING: tuple is: id[int4]:7 somedata[int4]:3 text[varchar]:1 WARNING: COMMIT WARNING: BEGIN WARNING: tuple is: id[int4]:8 somedata[int4]:3 text[varchar]:2 WARNING: tuple is: id[int4]:9 somedata[int4]:3 text[varchar]:3 WARNING: COMMIT WARNING: BEGIN WARNING: COMMIT WARNING: BEGIN WARNING: tuple is: id[int4]:10 somedata[int4]:4 somenum[varchar]:1 WARNING: COMMIT WARNING: BEGIN WARNING: COMMIT WARNING: BEGIN WARNING: tuple is: id[int4]:11 somedata[int4]:5 somenum[int4]:1 WARNING: COMMIT decode_xlog ------------- t (1 row) As you can see the patchset can decode several changes made to a table even though we used DDL on it. Not everything is handled yet, but its a prototype after all ;) The way this works is: A new component called SnapshotBuilder analyzes the xlog and build a special kind of Snapshot. This works in a somewhat similar way to the KnownAssignedXids machinery for Hot Standby. Whenever the - mostly unchanged - ApplyCache calls a 'apply_change' callback for a single change (INSERT|UPDATE|DELETE) it locally overrides the normal SnapshotNow semantics used for catalog access with one of the previously built snapshots. They should behave just the same as a normal SnapshotNow would have behaved when the tuple change was written to the xlog. This patch doesn't provide anything that uses the new infrastructure for anything real, but I think thats good. Lets get this into something committable and then add new things using it! Small overview over the individual patches that will come as separate mails: old, Alvaro is doing this properly right now, separate thread [01] Add embedded list interface (header only) A new piece of infrastructure (for k-way mergesort), pretty much untested, good idea in general I think, not very interesting: [02] Add minimal binary heap implementation Boring, old.: [03] Add support for a generic wal reading facility dubbed XLogReader Boring, old, borked: [04] add simple xlogdump tool Slightly changed to use (tablespace, relfilenode), possibly similar problems to earlier, not interesting at this point. [05] Add a new syscache to fetch a pg_class entry via (reltablespace, relfilenode) Unchanged: [06] Log enough data into the wal to reconstruct logical changes from it if wal_level=logical I didn't implement proper cache handling, so I need to use the big hammer...: [07] Make InvalidateSystemCaches public The major piece: [08] Introduce wal decoding via catalog timetravel [08] has loads of defficiencies. To cite the commit: The snapshot building has the most critical infrastructure but misses several important features: * loads of docs about the internals * improve snapshot building/distributions * don't build them all the time, cache them * don't increase ->xmax so slowly, its inefficient * refcount * actually free them * proper cache handling * we can probably reuse xl_xact_commit->nmsgs * generate new local inval messages from catalog changes? * handle transactions with both ddl, and changes * command_id handling * combocid loggin/handling * Add support for declaring tables as catalog tables that are not pg_catalog.* * properly distribute new SnapshotNow snapshots after a transaction commits * loads of testing/edge cases * provision of a consistent snapshot for pg_dump * spill state to disk at checkpoints * xmin handling The decode_xlog() function is *purely* a debugging tool that I do not want to keep in the long run. I introduced it so we can concentrate on the topic at hand without involving even more moving parts (see the next paragraph)... Some parts of this I would like to only discuss later, in separate threads, to avoid cluttering this one more than neccessary: * how do we integrate this into walsender et al * in which format do we transport changes * how do we always keep enough wal I have some work ontop of this, that handles ComboCid's and CommandId's correctly (and thus mixed ddl/dml transactions), but its simply not finished enough. I am pretty sure by now that it works even with those additional complexities. So, I am unfortunately too tired to write more than this... It will have to suffice. I plan to release a newer version with more documentation soon. Comments about the approach or even the general direction of the implementation? Questions? Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers