I agree that we aren't MVCC with respect to DDL operations (and for this purpose CLUSTER is DDL). Trying to become so would open a can of worms far larger than it's worth, though, IMHO.
I think if we can come up with a reasonable way to handle all the consequences, it's worth doing. And yes, I realize there are a lot of consequences, so it may well not be possible.
Transaction 1 Transaction 2
BEGIN;
... BEGIN;
... INSERT INTO a;
CLUSTER a; ...
Currently, because T2 continues to hold a write lock on A until it commits, T1's CLUSTER will block until T2 commits; therefore CLUSTER's use of SnapshotNow is sufficient to not lose any live tuples. Were T1 to use a transaction-aware snapshot to scan A, this would not be so.
I think this is somewhat tangential: we're discussing changing the snapshot used to scan system catalogs, not user relations like A. The only reason that CLUSTER's use of SnapshotNow is a problem at the moment is the same reason that TRUNCATE is a problem -- a concurrent serializable transaction will use the new relfilenode, not the old one.
Transaction 1 Transaction 2
BEGIN;
... CLUSTER a;
INSERT INTO a;
Were T1 using a transaction-aware snapshot to read pg_class, it would insert its new tuple into the wrong relfilenode for A, causing either immediate failure or eventual loss of a live tuple.
Yes, definitely a problem :( The same applies to TRUNCATE, naturally. The only somewhat reasonable behavior I can think of is to cause modifications to the oldrelfilenode to fail in a concurrent serializable transaction. The behavior would be somewhat analogous to an UPDATE in a serializable transaction failing because of a concurrent data modification, although in this case we would error out on any modification (e.g. INSERT).
-Neil
---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])