Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled

2007-10-18 Thread Jacky Leng

 Heikki Linnakangas [EMAIL PROTECTED] writes:
 I tend to agree that truncating the file, and extending the fsync
 request mechanism to actually delete it after the next checkpoint,
 is the most reasonable route to a fix.


How about just allowing to use wal even WAL archiving is disabled?
It seems that recovery of XLOG_HEAP_NEWPAGE record will do the
right thing for us, look at heap_xlog_newpage: XLogReadBuffer
with init=true will extend the block rightly and rebuild it rightly.

Someone may say that it's not worth recording xlog for operations
such as copy_relation_data, but these operations shouldn't happen
frequently. 



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled

2007-10-18 Thread Heikki Linnakangas
Jacky Leng wrote:
 I tend to agree that truncating the file, and extending the fsync
 request mechanism to actually delete it after the next checkpoint,
 is the most reasonable route to a fix.
 
 How about just allowing to use wal even WAL archiving is disabled?
 It seems that recovery of XLOG_HEAP_NEWPAGE record will do the
 right thing for us, look at heap_xlog_newpage: XLogReadBuffer
 with init=true will extend the block rightly and rebuild it rightly.
 
 Someone may say that it's not worth recording xlog for operations
 such as copy_relation_data, but these operations shouldn't happen
 frequently. 

Always using WAL would fix the problem, but it's a big performance hit.
WAL-logging doubles the amount of write I/O required.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled

2007-10-18 Thread Heikki Linnakangas
Heikki Linnakangas wrote:
 Tom Lane wrote:
 I tend to agree that truncating the file, and extending the fsync
 request mechanism to actually delete it after the next checkpoint,
 is the most reasonable route to a fix.
 
 Ok, I'll write a patch to do that.

There's a small problem with that: DROP TABLESPACE checks that the
tablespace directory is empty, and fails if it sees one of those empty
files. You also run into that problem if you

1. BEGIN; CREATE TABLE; -- no commit
2. crash+restart
3. DROP TABLESPACE

because we leave behind the stale file created by CREATE TABLE.

The best I can think of is to rename the obsolete file to
relfilenode.stale, when it's scheduled for deletion at next
checkpoint, and check for .stale-suffixed files in GetNewRelFileNode,
and delete them immediately in DropTableSpace.

That still won't fix the problem with files created by a crashed
transaction. For that we had a plan a long time ago: after recovery,
scan the data directory for any files don't have a live row in pg_class,
and write a message to log for each so that the DBA can delete them
(deleting them automatically was considered too dangerous). That's
probably 8.4 material, though.

Thoughts?

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled

2007-10-18 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 The best I can think of is to rename the obsolete file to
 relfilenode.stale, when it's scheduled for deletion at next
 checkpoint, and check for .stale-suffixed files in GetNewRelFileNode,
 and delete them immediately in DropTableSpace.

This is getting too Rube Goldbergian for my tastes.  What if we just
make DROP TABLESPACE force a checkpoint before proceeding?

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled

2007-10-18 Thread Heikki Linnakangas
Tom Lane wrote:
 Heikki Linnakangas [EMAIL PROTECTED] writes:
 The best I can think of is to rename the obsolete file to
 relfilenode.stale, when it's scheduled for deletion at next
 checkpoint, and check for .stale-suffixed files in GetNewRelFileNode,
 and delete them immediately in DropTableSpace.
 
 This is getting too Rube Goldbergian for my tastes.  What if we just
 make DROP TABLESPACE force a checkpoint before proceeding?

True, that would work. DROP TABLESPACE should be uncommon enough that
the performance hit is ok. We only need to checkpoint if the directory
isn't empty, though I think that's the case more often than not; you're
most likely to drop a tablespace right after dropping all relations in it.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled

2007-10-18 Thread Florian G. Pflug

Heikki Linnakangas wrote:

Tom Lane wrote:

I tend to agree that truncating the file, and extending the fsync
request mechanism to actually delete it after the next checkpoint,
is the most reasonable route to a fix.


Ok, I'll write a patch to do that.


What is the argument against making relfilenodes globally unique by adding the 
xid and epoch of the creating transaction to the filename? Those 64 bits could 
be stuffed into 13 bytes by base-36 encoding (A-Z,0-9). The maximum length of a 
relfilenode would then be 10 + 1 + 13 = 24, which any reasonable filesystem 
should support IMHO.


regards, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled

2007-10-18 Thread Florian G. Pflug

Heikki Linnakangas wrote:

Tom Lane wrote:

I tend to agree that truncating the file, and extending the fsync
request mechanism to actually delete it after the next checkpoint,
is the most reasonable route to a fix.


Ok, I'll write a patch to do that.


What is the argument against making relfilenodes globally unique by adding the 
xid and epoch of the creating transaction to the filename? Those 64 bits could 
be stuffed into 13 bytes by base-36 encoding (A-Z,0-9). The maximum length of a 
relfilenode would then be 10 + 1 + 13 = 24, which any reasonable filesystem 
should support IMHO.


regards, Florian Pflug

PS: Sorry if this arrives twice - I'm having a few troubles with my mail setup.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled

2007-10-18 Thread Heikki Linnakangas
Florian G. Pflug wrote:
 Heikki Linnakangas wrote:
 Tom Lane wrote:
 I tend to agree that truncating the file, and extending the fsync
 request mechanism to actually delete it after the next checkpoint,
 is the most reasonable route to a fix.

 Ok, I'll write a patch to do that.
 
 What is the argument against making relfilenodes globally unique by
 adding the xid and epoch of the creating transaction to the filename?
 Those 64 bits could be stuffed into 13 bytes by base-36 encoding
 (A-Z,0-9). The maximum length of a relfilenode would then be 10 + 1 + 13
 = 24, which any reasonable filesystem should support IMHO.

The size of would be xid + epoch + oid = 96 bits, not 64 bits.

That would work, but sounds like a much bigger change.

sizeof(RelFileNode) would increase from 12 to 20, so any data structure
that deals with RelFileNodes would take more memory. Hash function in
buffer manager would get more expensive. I remember seeing that showing
up in oprofile sometimes, but it'd need to be benchmarked to see if it
really matters.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled

2007-10-18 Thread Tom Lane
Florian G. Pflug [EMAIL PROTECTED] writes:
 What is the argument against making relfilenodes globally unique by adding 
 the 
 xid and epoch of the creating transaction to the filename?

1. Zero chance of ever backpatching.  (I know I said I wasn't excited
   about that, but it's still a strike against a proposed fix.)

2. Adds new fields to RelFileNode, which will be a major code change,
   and possibly a noticeable performance hit (bigger hashtable keys).

3. Adds new columns to pg_class, which is a real PITA ...

4. Breaks oid2name and all similar code that knows about relfilenode.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled

2007-10-18 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

What is the argument against making relfilenodes globally unique by adding
the xid and epoch of the creating transaction to the filename?


1. Zero chance of ever backpatching.  (I know I said I wasn't excited about
that, but it's still a strike against a proposed fix.)

2. Adds new fields to RelFileNode, which will be a major code change, and
possibly a noticeable performance hit (bigger hashtable keys).

3. Adds new columns to pg_class, which is a real PITA ...

4. Breaks oid2name and all similar code that knows about relfilenode.


Ah, Ok. I was under the impression that relfilenode in pg_class is a string of
some kind. In that case only GetNewRelFileNode would have needed patching...
But that is obviously not the case, as I realized now :-(

Thanks for setting me straight ;-)

regards, Florian Pflug

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled

2007-10-17 Thread Heikki Linnakangas
Simon Riggs wrote:
 On Wed, 2007-10-17 at 15:02 +0100, Heikki Linnakangas wrote:
 Simon Riggs wrote:
 If you've got a better problem statement it would be good to get that
 right first before we discuss solutions.
 Reusing a relfilenode of a deleted relation, before next checkpoint
 following the commit of the deleting transaction, for an operation that
 doesn't WAL log the contents of the new relation, leads to data loss on
 recovery.
 
 OK, thanks. 
 
 I wasn't aware we reused refilenode ids. The code in GetNewOid() doesn't
 look deterministic to me, or at least isn't meant to be.
 GetNewObjectId() should be cycling around, so although the oid index
 scan using SnapshotDirty won't see committed deleted rows that shouldn't
 matter for 2^32 oids. So what gives?

I don't think you still quite understand what's happening. GetNewOid()
is not interesting here, look at GetNewRelFileNode() instead. And
neither are snapshots or MVCC visibility rules.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled

2007-10-17 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 I don't think you still quite understand what's happening. GetNewOid()
 is not interesting here, look at GetNewRelFileNode() instead. And
 neither are snapshots or MVCC visibility rules.

Simon has a legitimate objection; not that there's no bug, but that the
probability of getting bitten is exceedingly small.  The test script you
showed cheats six-ways-from-Sunday to cause an OID collision that would
never happen in practice.  The only case where it would really happen
is if a table that has existed for a long time (~ 2^32 OID creations)
gets dropped and then you're unlucky enough to recycle that exact OID
before the next checkpoint --- and then crash before the checkpoint.

I think we should think about ways to fix this, but I don't feel a need
to try to backpatch a solution.

I tend to agree that truncating the file, and extending the fsync
request mechanism to actually delete it after the next checkpoint,
is the most reasonable route to a fix.

I think the objection about leaking files on crash is wrong. We'd
have the replay of the deletion to fix things up --- it could probably
delete the file immediately, and if not could certainly put it back
on the fsync request queue.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled

2007-10-17 Thread Heikki Linnakangas
Tom Lane wrote:
 Simon has a legitimate objection; not that there's no bug, but that the
 probability of getting bitten is exceedingly small.  

Oh, if that's what he meant, he's right.

 The test script you
 showed cheats six-ways-from-Sunday to cause an OID collision that would
 never happen in practice.  The only case where it would really happen
 is if a table that has existed for a long time (~ 2^32 OID creations)
 gets dropped and then you're unlucky enough to recycle that exact OID
 before the next checkpoint --- and then crash before the checkpoint.

Yeah, it's unlikely to happen, but the consequences are horrible.

Note that it's not just DROP TABLE that's a problem, but anything that
uses smgrscheduleunlink, including CLUSTER and REINDEX.

 I tend to agree that truncating the file, and extending the fsync
 request mechanism to actually delete it after the next checkpoint,
 is the most reasonable route to a fix.

Ok, I'll write a patch to do that.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled

2007-10-17 Thread Simon Riggs
On Wed, 2007-10-17 at 17:36 +0100, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  On Wed, 2007-10-17 at 15:02 +0100, Heikki Linnakangas wrote:
  Simon Riggs wrote:
  If you've got a better problem statement it would be good to get that
  right first before we discuss solutions.
  Reusing a relfilenode of a deleted relation, before next checkpoint
  following the commit of the deleting transaction, for an operation that
  doesn't WAL log the contents of the new relation, leads to data loss on
  recovery.
  
  OK, thanks. 
  
  I wasn't aware we reused refilenode ids. The code in GetNewOid() doesn't
  look deterministic to me, or at least isn't meant to be.
  GetNewObjectId() should be cycling around, so although the oid index
  scan using SnapshotDirty won't see committed deleted rows that shouldn't
  matter for 2^32 oids. So what gives?
 
 I don't think you still quite understand what's happening. 

Clearly. It's not a problem to admit that.

 GetNewOid()
 is not interesting here, look at GetNewRelFileNode() instead. And
 neither are snapshots or MVCC visibility rules.

Which calls GetNewOid() in all cases, AFAICS.

How does the reuse you say is happening come about? Seems like the bug
is in the reuse, not in how we cope with potential reuse.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled

2007-10-17 Thread Simon Riggs
On Wed, 2007-10-17 at 18:13 +0100, Heikki Linnakangas wrote:

  The test script you
  showed cheats six-ways-from-Sunday to cause an OID collision that would
  never happen in practice.  The only case where it would really happen
  is if a table that has existed for a long time (~ 2^32 OID creations)
  gets dropped and then you're unlucky enough to recycle that exact OID
  before the next checkpoint --- and then crash before the checkpoint.
 
 Yeah, it's unlikely to happen, but the consequences are horrible.

When is this going to happen?

We'd need to insert 2^32 toast chunks, which is 4 TB of data, or insert
2^32 large objects, or create 2^32 tables, or any combination of the
above all within one checkpoint duration *and* exactly hit the exact
same relation.

That's a weird and huge application, a very fast server and an unlucky
DBA to hit the exact OID to be reused and then have the server crash so
we'll ever notice.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 6: explain analyze is your friend