Hi,

It took me far longer than I planned, its not finished, but time is running 
out. I would like some feedback that I am not going astray at this point... 
*I* think the general approach is sound and a good way forward that provides 
the basic infrastructure for many (all?) of the scenarios we talked about 
before.

Anyway, here is my next attempt at $TOPIC.

Lets start with a quick demo (via psql):

/* just so we keep a sensible xmin horizon */
ROLLBACK PREPARED 'f';
BEGIN;
CREATE TABLE keepalive();
PREPARE TRANSACTION 'f';

DROP TABLE IF EXISTS replication_example;

SELECT pg_current_xlog_insert_location();
CHECKPOINT;
CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text 
varchar(120));
begin;
INSERT INTO replication_example(somedata, text) VALUES (1, 1);
INSERT INTO replication_example(somedata, text) VALUES (1, 2);
commit;

ALTER TABLE replication_example ADD COLUMN bar int;

INSERT INTO replication_example(somedata, text, bar) VALUES (2, 1, 4);

BEGIN;
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 2, 4);
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 3, 4);
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 4, NULL);

commit;
ALTER TABLE replication_example DROP COLUMN bar;
INSERT INTO replication_example(somedata, text) VALUES (3, 1);
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (3, 2);
INSERT INTO replication_example(somedata, text) VALUES (3, 3);
commit;

ALTER TABLE replication_example RENAME COLUMN text TO somenum;

INSERT INTO replication_example(somedata, somenum) VALUES (4, 1);

ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING 
(somenum::int4);

INSERT INTO replication_example(somedata, somenum) VALUES (5, 1);

SELECT pg_current_xlog_insert_location();

---- Somewhat later ----

SELECT decode_xlog('0/1893D78', '0/18BE398');

WARNING:  BEGIN
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:1 somedata[int4]:1 text[varchar]:1
WARNING:  tuple is: id[int4]:2 somedata[int4]:1 text[varchar]:2
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:3 somedata[int4]:2 text[varchar]:1 bar[int4]:4
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:4 somedata[int4]:2 text[varchar]:2 bar[int4]:4
WARNING:  tuple is: id[int4]:5 somedata[int4]:2 text[varchar]:3 bar[int4]:4
WARNING:  tuple is: id[int4]:6 somedata[int4]:2 text[varchar]:4 bar[int4]:
(null)
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:7 somedata[int4]:3 text[varchar]:1
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:8 somedata[int4]:3 text[varchar]:2
WARNING:  tuple is: id[int4]:9 somedata[int4]:3 text[varchar]:3
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:10 somedata[int4]:4 somenum[varchar]:1
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:11 somedata[int4]:5 somenum[int4]:1
WARNING:  COMMIT
 decode_xlog 
-------------
 t
(1 row)


As you can see the patchset can decode several changes made to a table even 
though we used DDL on it. Not everything is handled yet, but its a prototype 
after all ;)

The way this works is:

A new component called SnapshotBuilder analyzes the xlog and build a special 
kind of Snapshot. This works in a somewhat similar way to the 
KnownAssignedXids machinery for Hot Standby.
Whenever the - mostly unchanged - ApplyCache calls a 'apply_change' callback 
for a single change (INSERT|UPDATE|DELETE) it locally overrides the normal 
SnapshotNow semantics used for catalog access with one of the previously built 
snapshots. They should behave just the same as a normal SnapshotNow would have 
behaved when the tuple change was written to the xlog.

This patch doesn't provide anything that uses the new infrastructure for 
anything real, but I think thats good. Lets get this into something 
committable and then add new things using it!

Small overview over the individual patches that will come as separate mails:

old, Alvaro is doing this properly right now, separate thread
[01] Add embedded list interface (header only) 

A new piece of infrastructure (for k-way mergesort), pretty much untested, 
good idea in general I think, not very interesting:
[02] Add minimal binary heap implementation

Boring, old.:
[03] Add support for a generic wal reading facility dubbed XLogReader

Boring, old, borked:
[04] add simple xlogdump tool

Slightly changed to use (tablespace, relfilenode), possibly similar problems 
to earlier, not interesting at this point.
[05] Add a new syscache to fetch a pg_class entry via (reltablespace, 
relfilenode)

Unchanged:
[06] Log enough data into the wal to reconstruct logical changes from it if 
wal_level=logical

I didn't implement proper cache handling, so I need to use the big hammer...:
[07] Make InvalidateSystemCaches public

The major piece:
[08] Introduce wal decoding via catalog timetravel



[08] has loads of defficiencies. To cite the commit:
    The snapshot building has the most critical infrastructure but misses 
several
    important features:
    * loads of docs about the internals
    * improve snapshot building/distributions
      * don't build them all the time, cache them
      * don't increase ->xmax so slowly, its inefficient
      * refcount
      * actually free them
    * proper cache handling
      * we can probably reuse xl_xact_commit->nmsgs
      * generate new local inval messages from catalog changes?
    * handle transactions with both ddl, and changes
      * command_id handling
      * combocid loggin/handling
    * Add support for declaring tables as catalog tables that are not 
pg_catalog.*
    * properly distribute new SnapshotNow snapshots after a transaction 
commits
    * loads of testing/edge cases
    * provision of a consistent snapshot for pg_dump
    * spill state to disk at checkpoints
    * xmin handling

The decode_xlog() function is *purely* a debugging tool that I do not want to 
keep in the long run. I introduced it so we can concentrate on the topic at 
hand without involving even more moving parts (see the next paragraph)...

Some parts of this I would like to only discuss later, in separate threads, to 
avoid cluttering this one more than neccessary:
* how do we integrate this into walsender et al
* in which format do we transport changes
* how do we always keep enough wal 


I have some work ontop of this, that handles ComboCid's and CommandId's 
correctly (and thus mixed ddl/dml transactions), but its simply not finished 
enough. I am pretty sure by now that it works even with those additional 
complexities.

So, I am unfortunately too tired to write more than this... It will have to 
suffice. I plan to release a newer version with more documentation soon.

Comments about the approach or even the general direction of the 
implementation? Questions?

Greetings,

Andres
-- 
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to