Re: [HACKERS] Why copy_relation_data only use walwhenWALarchivingis enabled

2007-10-17 Thread Heikki Linnakangas
Simon Riggs wrote:
 On Wed, 2007-10-17 at 17:36 +0100, Heikki Linnakangas wrote:
 Simon Riggs wrote:
 On Wed, 2007-10-17 at 15:02 +0100, Heikki Linnakangas wrote:
 Simon Riggs wrote:
 If you've got a better problem statement it would be good to get that
 right first before we discuss solutions.
 Reusing a relfilenode of a deleted relation, before next checkpoint
 following the commit of the deleting transaction, for an operation that
 doesn't WAL log the contents of the new relation, leads to data loss on
 recovery.
 OK, thanks. 

 I wasn't aware we reused refilenode ids. The code in GetNewOid() doesn't
 look deterministic to me, or at least isn't meant to be.
 GetNewObjectId() should be cycling around, so although the oid index
 scan using SnapshotDirty won't see committed deleted rows that shouldn't
 matter for 2^32 oids. So what gives?
 I don't think you still quite understand what's happening. 
 
 Clearly. It's not a problem to admit that.
 
 GetNewOid()
 is not interesting here, look at GetNewRelFileNode() instead. And
 neither are snapshots or MVCC visibility rules.
 
 Which calls GetNewOid() in all cases, AFAICS.
 
 How does the reuse you say is happening come about? Seems like the bug
 is in the reuse, not in how we cope with potential reuse.

After a table is dropped, the dropping transaction has been committed,
and the relation file has been deleted, there's nothing preventing the
reuse. There's no trace of that relfilenode in the system (except in the
WAL, which we never look into except on WAL replay). There's a dead row
in pg_class with that relfilenode, but even that could be vacuumed away
(not that it matters because we don't examine that).

Now the problem is that there's a record in the WAL to delete a relation
file with that relfilenode. If that relfilenode was reused, we delete
the contents of the new relation file when we replay that WAL record.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Why copy_relation_data only use walwhenWALarchivingis enabled

2007-10-17 Thread Heikki Linnakangas
Simon Riggs wrote:
 On Wed, 2007-10-17 at 18:13 +0100, Heikki Linnakangas wrote:
 
 The test script you
 showed cheats six-ways-from-Sunday to cause an OID collision that would
 never happen in practice.  The only case where it would really happen
 is if a table that has existed for a long time (~ 2^32 OID creations)
 gets dropped and then you're unlucky enough to recycle that exact OID
 before the next checkpoint --- and then crash before the checkpoint.
 Yeah, it's unlikely to happen, but the consequences are horrible.
 
 When is this going to happen?
 
 We'd need to insert 2^32 toast chunks, which is 4 TB of data, or insert
 2^32 large objects, or create 2^32 tables, or any combination of the
 above all within one checkpoint duration *and* exactly hit the exact
 same relation.

The consumption of the OIDs don't need to happen within one checkpoint
duration. As long as the DROP and the reuse happens in the same
checkpoint cycle, you're screwed.

Granted that you're not likely to ever experience OID wrap-around unless
you have a heavily used user table with OIDs. Or create/drop temp tables
a lot.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org