Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled
> You need to set $PGDATA before running the script. And psql,pg_ctl and > pg_resetxlog need to be in $PATH. After running the script, restart > postmaster and run "SELECT * FROM t2". There should be one row in the > table, but it's empty. I've tried this script on "postgres (PostgreSQL) 8.3devel", and found that T2 is not empty after recovery(just as it should be)---but the latest version act just like what you said. Then I see how cluster is done, and found that: In "postgres (PostgreSQL) 8.3devel", unlike AlterTableSetTablespace (which copys the whole relation block-by-block, and doesn't use wal under non-archiving mode), Cluster copys the relation row-by-row(simple_heap_insert), which always uses wal regardless of archiving mode. As wal exists, recovery will cope with everything rightly. The latest version acts differently probably because that it removes wal of cluser under non-archiving mode. So the conclusion is: we can replace wal mechanism with smgrimmedsync only if relfilenode is not allowed to be reused, but this's not true, so what we should keep wal. Is it right? ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled
> You need to set $PGDATA before running the script. And psql,pg_ctl and > pg_resetxlog need to be in $PATH. After running the script, restart > postmaster and run "SELECT * FROM t2". There should be one row in the > table, but it's empty. I've tried this script on "postgres (PostgreSQL) 8.3devel", and found that T2 is not empty after recovery(just as it should be)---but the latest version act just like what you said. Then I see how cluster is done, and found that: In "postgres (PostgreSQL) 8.3devel", unlike AlterTableSetTablespace (which copys the whole relation block-by-block, and doesn't use wal under non-archiving mode), Cluster copys the relation row-by-row(simple_heap_insert), which always uses wal regardless of archiving mode. As wal exists, recovery will cope with everything rightly. The latest version acts differently probably because that it removes wal of cluser under non-archiving mode. So the conclusion is: we can replace wal mechanism with smgrimmedsync only if relfilenode is not allowed to be reused, but this's not true, so what we should keep wal. Is it right? ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled
Sorry, send the mail wrongly just now. > You need to set $PGDATA before running the script. And psql,pg_ctl and > pg_resetxlog need to be in $PATH. After running the script, restart > postmaster and run "SELECT * FROM t2". There should be one row in the > table, but it's empty. I've tried this script on "postgres (PostgreSQL) 8.3devel", and found that T2 is not empty after recovery(just as it should be)---but the latest version act just like what you said. Then I see how cluster is done, and found that: In "postgres (PostgreSQL) 8.3devel", unlike AlterTableSetTablespace (which copys the whole relation block-by-block, and doesn't use wal under non-archiving mode), Cluster copys the relation row-by-row(simple_heap_insert), which always uses wal regardless of archiving mode. As wal exists, recovery will cope with everything rightly. The latest version acts differently probably because that it removes wal of cluser under non-archiving mode. So the conclusion is: we can replace wal mechanism with smgrimmedsync only if relfilenode is not allowed to be reused, but this's not true, so what we should keep wal. Is it right? ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled
> You need to set $PGDATA before running the script. And psql,pg_ctl and > pg_resetxlog need to be in $PATH. After running the script, restart > postmaster and run "SELECT * FROM t2". There should be one row in the > table, but it's empty. I've tried this script, and superisingly found that T2 is not empty, just as it should be. Then I see how cluster is done, and found that ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled
Florian G. Pflug wrote: > Heikki Linnakangas wrote: >> I wrote: >>> Unfortunately I don't see any easy way to fix it. One approach would be >>> to avoid reusing the relfilenodes until next checkpoint, but I don't see >>> any nice place to keep track of OIDs that have been dropped since last >>> checkpoint. >> >> Ok, here's one idea: >> >> Instead of deleting the file immediately on commit of DROP TABLE, the >> file is truncated to release the space, but not unlink()ed, to avoid >> reusing that relfilenode. The truncated file can be deleted after next >> checkpoint. >> >> Now, how does checkpoint know what to delete? We can use the fsync >> request mechanism for that. When a file is truncated, a new kind of >> fsync request, a "deletion request", is sent to the bgwriter, which >> collects all such requests to a list. Before checkpoint calculates new >> RedoRecPtr, the list is swapped with an empty one, and after writing the >> new checkpoint record, all the files that were in the list are deleted. >> >> We would leak empty files on crashes, but we leak files on crashes >> anyway, so that shouldn't be an issue. This scheme wouldn't require >> catalog changes, so it would be suitable for backpatching. >> >> Any better ideas? > Couldn't we fix this by forcing a checkpoint before we commit the > transaction that created the new pg_class entry for the clustered table? > Or rather, more generally, before committing a transaction that created > a new non-temporary relfilenode but didn't WAL-log any subsequent inserts. Yes, that would work. As a small optimization, you could set a flag in shared mem whenever you delete a rel file, and skip the checkpoint when that flag isn't set. > Thats of course a rather sledgehammer-like approach to this problem - > but at least for the backbranched the fix would be less intrusive... Too much of a sledgehammer IMHO. BTW, CREATE INDEX is also vulnerable. And in 8.3, COPY to a table created/truncated in the same transaction. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled
Simon Riggs wrote: On Wed, 2007-10-17 at 12:11 +0100, Heikki Linnakangas wrote: Simon Riggs wrote: On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote: Second, suppose that no checkpoint has occured during the upper series--authough not quite possible; That part is irrelevant. It's forced out to disk and doesn't need recovery, with or without the checkpoint. There's no hole that I can see. No, Jacky is right. The same problem exists at least with CLUSTER, and I think there's other commands that rely on immediate fsync as well. Attached is a shell script that demonstrates the problem on CVS HEAD with CLUSTER. It creates two tables, T1 and T2, both with one row. Then T1 is dropped, and T2 is CLUSTERed, so that the new T2 relation file happens to get the same relfilenode that T1 had. Then we crash the server, forcing a WAL replay. After that, T2 is empty. Oops. Unfortunately I don't see any easy way to fix it. So, what you are saying is that re-using relfilenodes can cause problems during recovery in any command that alters the relfilenode of a relation? For what I understand, I'd say that creating a relfilenode *and* subsequently inserting data without WAL-logging causes the problem. If the relfilenode was recently deleted, the inserts might be effectively undone upon recovery (because we first replay the delete), but later *not* redone (because we didn't WAL-log the inserts). That brings me to another idea from a fix that is less heavyweight than my previous checkpoint-before-commit suggestion. We could make relfilenodes globally unique if we added the xid and epoch of the creating transaction to the filename. Those are 64 bits, so if we encode them in base 36 (using A-Z,0-9), that'd increase the length of the filenames by 13. regards, Florian Pflug ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled
On Wed, 2007-10-17 at 12:11 +0100, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote: > >> Second, suppose that no checkpoint has occured during the upper > >> series--authough not quite possible; > > > > That part is irrelevant. It's forced out to disk and doesn't need > > recovery, with or without the checkpoint. > > > > There's no hole that I can see. > > No, Jacky is right. The same problem exists at least with CLUSTER, and I > think there's other commands that rely on immediate fsync as well. > > Attached is a shell script that demonstrates the problem on CVS HEAD > with CLUSTER. It creates two tables, T1 and T2, both with one row. Then > T1 is dropped, and T2 is CLUSTERed, so that the new T2 relation file > happens to get the same relfilenode that T1 had. Then we crash the > server, forcing a WAL replay. After that, T2 is empty. Oops. > > Unfortunately I don't see any easy way to fix it. So, what you are saying is that re-using relfilenodes can cause problems during recovery in any command that alters the relfilenode of a relation? If you've got a better problem statement it would be good to get that right first before we discuss solutions. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled
Heikki Linnakangas wrote: I wrote: Unfortunately I don't see any easy way to fix it. One approach would be to avoid reusing the relfilenodes until next checkpoint, but I don't see any nice place to keep track of OIDs that have been dropped since last checkpoint. Ok, here's one idea: Instead of deleting the file immediately on commit of DROP TABLE, the file is truncated to release the space, but not unlink()ed, to avoid reusing that relfilenode. The truncated file can be deleted after next checkpoint. Now, how does checkpoint know what to delete? We can use the fsync request mechanism for that. When a file is truncated, a new kind of fsync request, a "deletion request", is sent to the bgwriter, which collects all such requests to a list. Before checkpoint calculates new RedoRecPtr, the list is swapped with an empty one, and after writing the new checkpoint record, all the files that were in the list are deleted. We would leak empty files on crashes, but we leak files on crashes anyway, so that shouldn't be an issue. This scheme wouldn't require catalog changes, so it would be suitable for backpatching. Any better ideas? Couldn't we fix this by forcing a checkpoint before we commit the transaction that created the new pg_class entry for the clustered table? Or rather, more generally, before committing a transaction that created a new non-temporary relfilenode but didn't WAL-log any subsequent inserts. Thats of course a rather sledgehammer-like approach to this problem - but at least for the backbranched the fix would be less intrusive... regards, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled
I wrote: > Unfortunately I don't see any easy way to fix it. One approach would be > to avoid reusing the relfilenodes until next checkpoint, but I don't see > any nice place to keep track of OIDs that have been dropped since last > checkpoint. Ok, here's one idea: Instead of deleting the file immediately on commit of DROP TABLE, the file is truncated to release the space, but not unlink()ed, to avoid reusing that relfilenode. The truncated file can be deleted after next checkpoint. Now, how does checkpoint know what to delete? We can use the fsync request mechanism for that. When a file is truncated, a new kind of fsync request, a "deletion request", is sent to the bgwriter, which collects all such requests to a list. Before checkpoint calculates new RedoRecPtr, the list is swapped with an empty one, and after writing the new checkpoint record, all the files that were in the list are deleted. We would leak empty files on crashes, but we leak files on crashes anyway, so that shouldn't be an issue. This scheme wouldn't require catalog changes, so it would be suitable for backpatching. Any better ideas? Do we care enough about this to fix this? Enough to backpatch? The probability of this happening is pretty small, but the consequences are really bad, so my vote is "yes" and "yes". -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled
Forgot to attach the script I promised.. You need to set $PGDATA before running the script. And psql,pg_ctl and pg_resetxlog need to be in $PATH. After running the script, restart postmaster and run "SELECT * FROM t2". There should be one row in the table, but it's empty. Heikki Linnakangas wrote: > Simon Riggs wrote: >> On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote: >>> Second, suppose that no checkpoint has occured during the upper >>> series--authough not quite possible; >> That part is irrelevant. It's forced out to disk and doesn't need >> recovery, with or without the checkpoint. >> >> There's no hole that I can see. > > No, Jacky is right. The same problem exists at least with CLUSTER, and I > think there's other commands that rely on immediate fsync as well. > > Attached is a shell script that demonstrates the problem on CVS HEAD > with CLUSTER. It creates two tables, T1 and T2, both with one row. Then > T1 is dropped, and T2 is CLUSTERed, so that the new T2 relation file > happens to get the same relfilenode that T1 had. Then we crash the > server, forcing a WAL replay. After that, T2 is empty. Oops. > > Unfortunately I don't see any easy way to fix it. One approach would be > to avoid reusing the relfilenodes until next checkpoint, but I don't see > any nice place to keep track of OIDs that have been dropped since last > checkpoint. > -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com cluster-relfilenode-clash.sh.gz Description: GNU Zip compressed data ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled
> On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote: >> Second, suppose that no checkpoint has occured during the upper >> series--authough not quite possible; > > That part is irrelevant. It's forced out to disk and doesn't need > recovery, with or without the checkpoint. > > There's no hole that I can see. Yes, it's really forced out. But if there's no checkpoint, the recovery process will begin from the time point before T1 is created, and as T1 was dropped, it'll remove T2's file! > -- > Simon Riggs > 2ndQuadrant http://www.2ndQuadrant.com > > > ---(end of broadcast)--- > TIP 2: Don't 'kill -9' the postmaster > ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled
Simon Riggs wrote: > On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote: >> Second, suppose that no checkpoint has occured during the upper >> series--authough not quite possible; > > That part is irrelevant. It's forced out to disk and doesn't need > recovery, with or without the checkpoint. > > There's no hole that I can see. No, Jacky is right. The same problem exists at least with CLUSTER, and I think there's other commands that rely on immediate fsync as well. Attached is a shell script that demonstrates the problem on CVS HEAD with CLUSTER. It creates two tables, T1 and T2, both with one row. Then T1 is dropped, and T2 is CLUSTERed, so that the new T2 relation file happens to get the same relfilenode that T1 had. Then we crash the server, forcing a WAL replay. After that, T2 is empty. Oops. Unfortunately I don't see any easy way to fix it. One approach would be to avoid reusing the relfilenodes until next checkpoint, but I don't see any nice place to keep track of OIDs that have been dropped since last checkpoint. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings