Re: [BUGS] page corruption after moving tablespace

2010-07-26 Thread Jeff Davis
On Thu, 2010-07-22 at 23:50 -0700, Jeff Davis wrote:
 I was investigating some strange page corruption today in which the page
 was completely zeroed except for the LSN and TLI.
 

I see that this was added to the 9.0 open items list, but it affects
versions 8.3 and later.

I should have indicated that in my initial report.

Regards,
Jeff Davis


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


[BUGS] page corruption after moving tablespace

2010-07-23 Thread Jeff Davis
I was investigating some strange page corruption today in which the page
was completely zeroed except for the LSN and TLI.

I found a sequence that can cause that problem even in 9.0:

(wal_level must be set to archive or greater)

1. Create a tablespace t1
2. Create a table foo
3. Attach to the backend with gdb, and set a breakpoint at the
START_CRITICAL_SECTION() line in heap_insert(). Continue in gdb.
4. Insert a tuple into foo.
5. gdb should break. At that time, send a SIGKILL.
6. restart the server (if it doesn't restart itself)
7. ALTER TABLE foo SET TABLESPACE t1;
8. SELECT * FROM foo;
ERROR:  invalid page header in block 0 of relation
pg_tblspc/16384/PG_9.1_201007151/11876/24576

The SIGKILL is just a way to get an all-zero page to end up in a heap
file. Any time any relation gets an all-zero page (which is generally
treated as a valid situation in postgres), changing the tablespace is a
problem. The code does a copy_relation_data, and that does a
log_newpage, and that sets the LSN and TLI on the page and then writes
it. But on an all-zero page, that leaves the page corrupt.

I think the simple fix would be to have copy_relation_data call
PageInit() if it's a new page. Are there other areas where a similar
problem might exist?

Regards,
Jeff Davis


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] page corruption after moving tablespace

2010-07-23 Thread Jeff Davis
On Thu, 2010-07-22 at 23:50 -0700, Jeff Davis wrote:
 I think the simple fix would be to have copy_relation_data call
 PageInit() if it's a new page.

On second thought, why are PageSetLSN and PageSetTLI being called from
log_newpage(), anyway? It says that all of the callers use smgr
directly, rather than the buffer cache.

Regards,
Jeff Davis


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs