Re: [HACKERS] DROP DATABASE vs patch to not remove files right away

2008-04-17 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:

Patch attached that does the three changes we've talked about:'
- make ForgetDatabaseFsyncRequests forget unlink requests as well
- make rmtree() not fail on ENOENT
- force checkpoint on in dropdb on all platforms


This looks fine as far as it goes


Ok, committed.


but I'm still thinking it's a bad
idea for rmtree() to fall over on the first failure.  I propose that it
ought to keep trying to delete the rest of the target directory tree.
Otherwise, one permissions problem (for example) could lead to most
of a dropped database not getting freed up,

>

If there aren't objections, I'll make that happen after Heikki applies
his patch.


Agreed. Be my guest.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DROP DATABASE vs patch to not remove files right away

2008-04-17 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Patch attached that does the three changes we've talked about:'
> - make ForgetDatabaseFsyncRequests forget unlink requests as well
> - make rmtree() not fail on ENOENT
> - force checkpoint on in dropdb on all platforms

This looks fine as far as it goes, but I'm still thinking it's a bad
idea for rmtree() to fall over on the first failure.  I propose that it
ought to keep trying to delete the rest of the target directory tree.
Otherwise, one permissions problem (for example) could lead to most
of a dropped database not getting freed up,

If there aren't objections, I'll make that happen after Heikki applies
his patch.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DROP DATABASE vs patch to not remove files right away

2008-04-17 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

ISTM that we must fix the bgwriter so that ForgetDatabaseFsyncRequests
causes PendingUnlinkEntrys for the doomed DB to be thrown away too.


Because of the asynchronous nature of ForgetDatabaseFsyncRequests, this 
still isn't enough, I'm afraid.


Hm.  I notice that there is no bug on Windows because dropdb forces a
checkpoint before it starts to remove files.  Maybe the sanest solution
is to just do that on all platforms.


Yeah, I don't see any other simple solution.

Patch attached that does the three changes we've talked about:'
- make ForgetDatabaseFsyncRequests forget unlink requests as well
- make rmtree() not fail on ENOENT
- force checkpoint on in dropdb on all platforms

plus some comment changes reflecting what we now know. I will apply this 
tomorrow if there's no objections.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/commands/dbcommands.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/dbcommands.c,v
retrieving revision 1.205
diff -c -r1.205 dbcommands.c
*** src/backend/commands/dbcommands.c	26 Mar 2008 21:10:37 -	1.205
--- src/backend/commands/dbcommands.c	17 Apr 2008 18:53:17 -
***
*** 696,713 
  	pgstat_drop_database(db_id);
  
  	/*
! 	 * Tell bgwriter to forget any pending fsync requests for files in the
! 	 * database; else it'll fail at next checkpoint.
  	 */
  	ForgetDatabaseFsyncRequests(db_id);
  
  	/*
! 	 * On Windows, force a checkpoint so that the bgwriter doesn't hold any
! 	 * open files, which would cause rmdir() to fail.
  	 */
- #ifdef WIN32
  	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
- #endif
  
  	/*
  	 * Remove all tablespace subdirs belonging to the database.
--- 696,714 
  	pgstat_drop_database(db_id);
  
  	/*
! 	 * Tell bgwriter to forget any pending fsync and unlink requests for files
! 	 * in the database; else it'll fail at next checkpoint, or worse it will
! 	 * delete files that belong to a newly created database with the same OID.
  	 */
  	ForgetDatabaseFsyncRequests(db_id);
  
  	/*
! 	 * Force a checkpoint to make sure the bgwriter has received the message
! 	 * sent by ForgetDatabaseFsyncRequests. On Windows, this also ensures that
! 	 * the bgwriter doesn't hold any open files, which would cause rmdir() to
! 	 * fail.
  	 */
  	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
  
  	/*
  	 * Remove all tablespace subdirs belonging to the database.
Index: src/backend/storage/smgr/md.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/smgr/md.c,v
retrieving revision 1.136
diff -c -r1.136 md.c
*** src/backend/storage/smgr/md.c	10 Mar 2008 20:06:27 -	1.136
--- src/backend/storage/smgr/md.c	17 Apr 2008 19:26:16 -
***
*** 1196,1203 
  		if (unlink(path) < 0)
  		{
  			/*
! 			 * ENOENT shouldn't happen either, but it doesn't really matter
! 			 * because we would've deleted it now anyway.
  			 */
  			if (errno != ENOENT)
  ereport(WARNING,
--- 1196,1206 
  		if (unlink(path) < 0)
  		{
  			/*
! 			 * There's a race condition, when the database is dropped at the
! 			 * same time that we process the pending unlink requests. If the
! 			 * DROP DATABASE deletes the file before we do, we will get ENOENT
! 			 * here. rmtree() also has to ignore ENOENT errors, to deal with
! 			 * the possibility that we delete the file first.
  			 */
  			if (errno != ENOENT)
  ereport(WARNING,
***
*** 1321,1327 
--- 1324,1334 
  		/* Remove any pending requests for the entire database */
  		HASH_SEQ_STATUS hstat;
  		PendingOperationEntry *entry;
+ 		ListCell   *cell, 
+    *prev,
+    *next;
  
+ 		/* Remove fsync requests */
  		hash_seq_init(&hstat, pendingOpsTable);
  		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
  		{
***
*** 1331,1336 
--- 1338,1359 
  entry->canceled = true;
  			}
  		}
+ 	
+ 		/* Remove unlink requests */
+ 		prev = NULL;
+ 		for (cell = list_head(pendingUnlinks); cell; cell = next)
+ 		{
+ 			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+ 
+ 			next = lnext(cell);
+ 			if (entry->rnode.dbNode == rnode.dbNode) 
+ 			{
+ pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
+ pfree(entry);
+ 			}
+ 			else
+ prev = cell;
+ 		}
  	}
  	else if (segno == UNLINK_RELATION_REQUEST)
  	{
***
*** 1386,1392 
  }
  
  /*
!  * ForgetRelationFsyncRequests -- ensure any fsyncs for a rel are forgotten
   */
  void
  ForgetRelationFsyncRequests(RelFileNode rnode)
--- 1409,1415 
  }
  
  /*
!  * ForgetRelationFsyncRequests -- forget any fsyncs for a rel
   */
  void
  ForgetRelationFsyncRequests(Re

Re: [HACKERS] DROP DATABASE vs patch to not remove files right away

2008-04-16 Thread Andrew Dunstan



Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:
  

Tom Lane wrote:


ISTM that we must fix the bgwriter so that ForgetDatabaseFsyncRequests
causes PendingUnlinkEntrys for the doomed DB to be thrown away too.
  


  
Because of the asynchronous nature of ForgetDatabaseFsyncRequests, this 
still isn't enough, I'm afraid.



Hm.  I notice that there is no bug on Windows because dropdb forces a
checkpoint before it starts to remove files.  Maybe the sanest solution
is to just do that on all platforms.


  


IIRC we did that to avoid an unlink problem :-)

I certainly don't see any particular reason not to do it everywhere.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DROP DATABASE vs patch to not remove files right away

2008-04-16 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> ISTM that we must fix the bgwriter so that ForgetDatabaseFsyncRequests
>> causes PendingUnlinkEntrys for the doomed DB to be thrown away too.

> Because of the asynchronous nature of ForgetDatabaseFsyncRequests, this 
> still isn't enough, I'm afraid.

Hm.  I notice that there is no bug on Windows because dropdb forces a
checkpoint before it starts to remove files.  Maybe the sanest solution
is to just do that on all platforms.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DROP DATABASE vs patch to not remove files right away

2008-04-16 Thread Heikki Linnakangas

Tom Lane wrote:

Actually ... what if the same DB OID and relfilenode get re-made before
the checkpoint?  Then we'd be unlinking live data.  This is improbable
but hardly less so than the scenario the PendingUnlinkEntry code was
put in to prevent.

ISTM that we must fix the bgwriter so that ForgetDatabaseFsyncRequests
causes PendingUnlinkEntrys for the doomed DB to be thrown away too.


Because of the asynchronous nature of ForgetDatabaseFsyncRequests, this 
still isn't enough, I'm afraid.


1. DROP TABLE footable;
2. Checkpoint begins.
3. DROP DATABASE foodb;
4. CREATE DATABASE bardb; -- bardb gets same OID as foodb, and a table 
copied from template database, let's call it bartable, gets same OID as 
footable
5. Checkpoint processes pending unlink for footable, but removes 
bartable instead


Needs more thought, after some sleep...

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DROP DATABASE vs patch to not remove files right away

2008-04-16 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:
Florian suggested a scheme where the xid and epoch is embedded in the 
filename, but that's unnecessarily complex. We could just make 
relfilenode a 64-bit integer. 2^64 should be enough for everyone.


Doesn't fix the problem unless DB and TS OIDs become int64 too;


Remember what the original scenario was:

1. DROP TABLE foo;
2. BEGIN; CREATE TABLE bar; COPY TO bar ...; COMMIT -- bar gets the same 
relfilenode as "foo", by chance

3. 
4. WAL replay of "DROP TABLE foo" deletes the data inserted at step 2. 
Because the table was created in the same transaction, we skipped WAL 
logging the inserts, and the data is lost.


Even if we can get a collision in DB or tablespace OIDs, we can't get a 
collision at step 2 if we don't reuse relfilenodes.



in fact, given that we generate relfilenodes off the OID counter,
it's difficult to see how you do this without making *all* OIDs
64-bit.


By generating them off a new 64-bit counter instead of the OID counter. 
Or by extending the OID counter to 64-bits, but only using the lower 64 
bits for OIDs.



Plus you're assuming that the machine has working 64-bit ints.
There's a large difference in my mind between saying "bigint
doesn't work right if you don't have working int64" and "we don't
guarantee the safety of your data if you don't have int64".
I'm not prepared to rip out the non-collision code until such
time as we irrevocably refuse to build on machines without int64.


Hmm. We could make it a struct of two int4s if we have to, couldn't we?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DROP DATABASE vs patch to not remove files right away

2008-04-16 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Florian suggested a scheme where the xid and epoch is embedded in the 
> filename, but that's unnecessarily complex. We could just make 
> relfilenode a 64-bit integer. 2^64 should be enough for everyone.

Doesn't fix the problem unless DB and TS OIDs become int64 too;
in fact, given that we generate relfilenodes off the OID counter,
it's difficult to see how you do this without making *all* OIDs
64-bit.

Plus you're assuming that the machine has working 64-bit ints.
There's a large difference in my mind between saying "bigint
doesn't work right if you don't have working int64" and "we don't
guarantee the safety of your data if you don't have int64".
I'm not prepared to rip out the non-collision code until such
time as we irrevocably refuse to build on machines without int64.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DROP DATABASE vs patch to not remove files right away

2008-04-16 Thread Heikki Linnakangas

Tom Lane wrote:

ISTM that we must fix the bgwriter so that ForgetDatabaseFsyncRequests
causes PendingUnlinkEntrys for the doomed DB to be thrown away too.
This should prevent the unlink-live-data scenario, I think.
Even then, concurrent deletion attempts are probably possible (since
ForgetDatabaseFsyncRequests is asynchronous) and rmtree() is being far
too fragile about dealing with them.  I think that it should be coded
to ignore ENOENT the same as the bgwriter does, and that it should press
on and keep trying to delete things even if it gets a failure.


Yep. I can write a patch for that, unless you're onto it already?

However, this makes me reconsider Florian's suggestion to just make
relfilenode larger and avoid reusing them altogether. It would simplify
the code quite a bit, and make it more robust. That is good because even 
if we fix these problems per your suggestion, I'm left wondering if 
we've missed some even weirder corner-cases.


Florian suggested a scheme where the xid and epoch is embedded in the 
filename, but that's unnecessarily complex. We could just make 
relfilenode a 64-bit integer. 2^64 should be enough for everyone.


You listed these problems with Florian's suggestion back then:


1. Zero chance of ever backpatching.  (I know I said I wasn't excited
   about that, but it's still a strike against a proposed fix.)


Still true. We would need to do what you suggested for 8.3, but 
simplifying the code would be good thing in the long run.



2. Adds new fields to RelFileNode, which will be a major code change,
   and possibly a noticeable performance hit (bigger hashtable keys).


We talked about this wrt. map forks, and concluded that it's not an 
issue. If we add the map forks as well, BufferTag struct would grow from 
 16 bytes to 24 bytes. It's worth doing some more micro-benchmarking 
with that, but it's probably acceptable. Or we could allocate a few bits 
of the 64-bit relfilenode field in RelFileNode to indicate the map fork.



3. Adds new columns to pg_class, which is a real PITA ...


We would only have to change relfilenode from oid to int64.


4. Breaks oid2name and all similar code that knows about relfilenode.


True, but they're not hard to fix.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DROP DATABASE vs patch to not remove files right away

2008-04-15 Thread Andrew Dunstan



Tom Lane wrote:

Alvaro Herrera <[EMAIL PROTECTED]> writes:
  

Tom Lane wrote:


I think that it should be coded
to ignore ENOENT the same as the bgwriter does, and that it should press
on and keep trying to delete things even if it gets a failure.
  


  

Perhaps, if it gets ENOENT, record this fact -- and after rmtree
returns, retry the whole thing.



Er, retry what?  There was, presumably, nothing to delete.


  


Yeah. I agree rmtree() should treat ENOENT as non-fatal.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DROP DATABASE vs patch to not remove files right away

2008-04-15 Thread Tom Lane
Alvaro Herrera <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> I think that it should be coded
>> to ignore ENOENT the same as the bgwriter does, and that it should press
>> on and keep trying to delete things even if it gets a failure.

> Perhaps, if it gets ENOENT, record this fact -- and after rmtree
> returns, retry the whole thing.

Er, retry what?  There was, presumably, nothing to delete.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DROP DATABASE vs patch to not remove files right away

2008-04-15 Thread Alvaro Herrera
Tom Lane wrote:

> I think that it should be coded
> to ignore ENOENT the same as the bgwriter does, and that it should press
> on and keep trying to delete things even if it gets a failure.

Perhaps, if it gets ENOENT, record this fact -- and after rmtree
returns, retry the whole thing.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] DROP DATABASE vs patch to not remove files right away

2008-04-15 Thread Tom Lane
Over the last couple days I twice saw complaints like this during
DROP DATABASE:

WARNING:  could not remove file or directory "base/80750/80825": No such file 
or directory
WARNING:  could not remove database directory "base/80750"

I poked at it for awhile and was eventually able to extract a
repeatable test case:

while true
do
psql -c "create database foo;" postgres || exit 1
psql -c "create table foo(f1 int primary key);" foo || exit 1
psql -c "drop table foo;" foo || exit 1
psql -c "checkpoint" postgres &
psql -c "drop database foo;" postgres || exit 1
done

On my machine this fairly consistently draws warnings in both 8.3 and
HEAD.  I believe what is happening is that the bgwriter has a
PendingUnlinkEntry for table foo, and completion of the checkpoint
prompts it to exercise that.  Meanwhile in the DROP DATABASE, rmtree is
working through a list of files to drop, and when it hits the
already-deleted one it complains --- and not only does it complain,
it stops trying to delete any more.  (The second WARNING is quite
misleading, because what it really means is "I stopped trying".)

Without the CHECKPOINT, what we get instead is that each cycle builds up
some more PendingUnlinkEntrys, which will all fail when the checkpoint
comes.  The bgwriter is coded to not report ENOENT, so you don't see any
evidence of that, but it's clearly a possible case and the comment
saying it shouldn't happen is misleading.

Actually ... what if the same DB OID and relfilenode get re-made before
the checkpoint?  Then we'd be unlinking live data.  This is improbable
but hardly less so than the scenario the PendingUnlinkEntry code was
put in to prevent.

ISTM that we must fix the bgwriter so that ForgetDatabaseFsyncRequests
causes PendingUnlinkEntrys for the doomed DB to be thrown away too.
This should prevent the unlink-live-data scenario, I think.
Even then, concurrent deletion attempts are probably possible (since
ForgetDatabaseFsyncRequests is asynchronous) and rmtree() is being far
too fragile about dealing with them.  I think that it should be coded
to ignore ENOENT the same as the bgwriter does, and that it should press
on and keep trying to delete things even if it gets a failure.

Thoughts?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers