Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag
On Wed, Mar 15, 2017 at 1:13 AM, Robert Haaswrote: > On Mon, Mar 13, 2017 at 8:51 PM, Michael Paquier > wrote: >>> I agree with that, but I propose the attached version instead. It >>> seems cleaner to have the entire test for setting BM_PERMANENT in one >>> place rather than splitting it up as you did. >> >> Fine for me. You may want to update the comment of BM_PERMANENT in >> buf_internals.h as Artur has mentioned upthread. For example by just >> adding "and init forks". > > OK, done, and back-patched all the way. Thanks. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag
On Mon, Mar 13, 2017 at 8:51 PM, Michael Paquierwrote: >> I agree with that, but I propose the attached version instead. It >> seems cleaner to have the entire test for setting BM_PERMANENT in one >> place rather than splitting it up as you did. > > Fine for me. You may want to update the comment of BM_PERMANENT in > buf_internals.h as Artur has mentioned upthread. For example by just > adding "and init forks". OK, done, and back-patched all the way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag
On Tue, Mar 14, 2017 at 4:46 AM, Robert Haaswrote: > On Wed, Jan 25, 2017 at 7:14 PM, Michael Paquier > wrote: >> (Adding Robert in CC.) >> >> On Thu, Jan 26, 2017 at 4:34 AM, Wang Hao wrote: >>> An unlogged table has an initialization fork. The initialization fork does >>> not have an BM_PERMANENT flag when get a buffer. >>> In checkpoint (not shutdown or end of recovery), it will not write to disk. >>> after a crash recovery, the page of initialization fork will not correctly, >>> then make the main fork not correctly too. >> >> For init forks the flush need absolutely to happen, so that's really >> not good. We ought to fix BufferAlloc() appropriately here. > > I agree with that, but I propose the attached version instead. It > seems cleaner to have the entire test for setting BM_PERMANENT in one > place rather than splitting it up as you did. Fine for me. You may want to update the comment of BM_PERMANENT in buf_internals.h as Artur has mentioned upthread. For example by just adding "and init forks". > I believe this sets a record for the longest-lived data corruption bug > in a commit made by me. Really? I'll need to double-check the git history here. > Six years and change, woohoo. :-( And that much for someone to report it. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag
On Wed, Jan 25, 2017 at 7:14 PM, Michael Paquierwrote: > (Adding Robert in CC.) > > On Thu, Jan 26, 2017 at 4:34 AM, Wang Hao wrote: >> An unlogged table has an initialization fork. The initialization fork does >> not have an BM_PERMANENT flag when get a buffer. >> In checkpoint (not shutdown or end of recovery), it will not write to disk. >> after a crash recovery, the page of initialization fork will not correctly, >> then make the main fork not correctly too. > > For init forks the flush need absolutely to happen, so that's really > not good. We ought to fix BufferAlloc() appropriately here. I agree with that, but I propose the attached version instead. It seems cleaner to have the entire test for setting BM_PERMANENT in one place rather than splitting it up as you did. I believe this sets a record for the longest-lived data corruption bug in a commit made by me. Six years and change, woohoo. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company unlogged-flush-fix-rmh.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag
On Sat, Mar 11, 2017 at 12:03 AM, Artur Zakirovwrote: > Because BM_PERMANENT is used for init forks of unlogged indexes now. Yes, indeed. -- Michael diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c index 913f1f8a51..3557b106d8 100644 --- a/contrib/bloom/blinsert.c +++ b/contrib/bloom/blinsert.c @@ -158,31 +158,24 @@ blbuild(Relation heap, Relation index, IndexInfo *indexInfo) void blbuildempty(Relation index) { - Page metapage; + Buffer MetaBuffer; - /* Construct metapage. */ - metapage = (Page) palloc(BLCKSZ); - BloomFillMetapage(index, metapage); + /* An empty bloom index has one meta page */ + MetaBuffer = + ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL); + LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE); - /* - * Write the page and log it. It might seem that an immediate sync - * would be sufficient to guarantee that the file exists on disk, but - * recovery itself might remove it while replaying, for example, an - * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we - * need this even when wal_level=minimal. - */ - PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO); - smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO, - (char *) metapage, true); - log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM, -BLOOM_METAPAGE_BLKNO, metapage, false); + /* Initialize the meta buffer */ + BloomFillMetapage(index, BufferGetPage(MetaBuffer)); - /* - * An immediate sync is required even if we xlog'd the page, because the - * write did not go through shared_buffers and therefore a concurrent - * checkpoint may have moved the redo pointer past our xlog record. - */ - smgrimmedsync(index->rd_smgr, INIT_FORKNUM); + /* And log it */ + START_CRIT_SECTION(); + MarkBufferDirty(MetaBuffer); + log_newpage_buffer(MetaBuffer, false); + END_CRIT_SECTION(); + + /* Unlock and release the buffer */ + UnlockReleaseBuffer(MetaBuffer); } /* diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 775f2ff1f8..e1801ea939 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -23,6 +23,7 @@ #include "access/xlog.h" #include "catalog/index.h" #include "commands/vacuum.h" +#include "miscadmin.h" #include "pgstat.h" #include "storage/condition_variable.h" #include "storage/indexfsm.h" @@ -282,31 +283,22 @@ btbuildCallback(Relation index, void btbuildempty(Relation index) { - Page metapage; - - /* Construct metapage. */ - metapage = (Page) palloc(BLCKSZ); - _bt_initmetapage(metapage, P_NONE, 0); - - /* - * Write the page and log it. It might seem that an immediate sync - * would be sufficient to guarantee that the file exists on disk, but - * recovery itself might remove it while replaying, for example, an - * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we - * need this even when wal_level=minimal. - */ - PageSetChecksumInplace(metapage, BTREE_METAPAGE); - smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE, - (char *) metapage, true); - log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM, -BTREE_METAPAGE, metapage, false); - - /* - * An immediate sync is required even if we xlog'd the page, because the - * write did not go through shared_buffers and therefore a concurrent - * checkpoint may have moved the redo pointer past our xlog record. - */ - smgrimmedsync(index->rd_smgr, INIT_FORKNUM); + Buffer metabuffer; + + /* An empty btree index has one meta page */ + metabuffer = + ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL); + LockBuffer(metabuffer, BUFFER_LOCK_EXCLUSIVE); + + /* Initialize and xlog meta buffer */ + START_CRIT_SECTION(); + _bt_initmetapage(BufferGetPage(metabuffer), P_NONE, 0); + MarkBufferDirty(metabuffer); + log_newpage_buffer(metabuffer, false); + END_CRIT_SECTION(); + + /* Unlock and release the buffer */ + UnlockReleaseBuffer(metabuffer); } /* diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c index 00a0ab4438..4252c2eb53 100644 --- a/src/backend/access/spgist/spginsert.c +++ b/src/backend/access/spgist/spginsert.c @@ -156,49 +156,50 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo) void spgbuildempty(Relation index) { - Page page; - - /* Construct metapage. */ - page = (Page) palloc(BLCKSZ); - SpGistInitMetapage(page); + Buffer MetaBuffer, +RootBuffer, +TupleBuffer; /* - * Write the page and log it unconditionally. This is important - * particularly for indexes created on tablespaces and databases - * whose creation happened after the last redo pointer as recovery - * removes any of their existing content when the corresponding - * create records are replayed. + * An empty SPGist index has three pages: + * - one meta page. + * - one root page. + * - one null-tuple root page. */ - PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO); - smgrwrite(index->rd_smgr, INIT_FORKNUM,
Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag
On 10.03.2017 04:00, Michael Paquier wrote: On Thu, Mar 9, 2017 at 10:25 PM, Artur Zakirovwrote: I think this is good fixes. I've checked them. And in my opinion they are correct. The code also is good. Having something with conflicts is not nice, so attached is a rebased version. Thank you! I've rerun regression and TAP tests. They all passed. Also maybe it will be good to fix comments. In buf_internals.h: #define BM_PERMANENT(1U << 31)/* permanent relation (not * unlogged) */ And in FlushBuffer(): /* * Force XLOG flush up to buffer's LSN. This implements the basic WAL * rule that log updates must hit disk before any of the data-file changes * they describe do. * * However, this rule does not apply to unlogged relations, which will be * lost after a crash anyway. Most unlogged relation pages do not bear Because BM_PERMANENT is used for init forks of unlogged indexes now. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag
On Thu, Mar 9, 2017 at 10:25 PM, Artur Zakirovwrote: > I think this is good fixes. I've checked them. And in my opinion they are > correct. > > The code also is good. Having something with conflicts is not nice, so attached is a rebased version. > I have run regression and TAP tests. They all passed without error. > > I think the patch can be marked as "Ready for Committer" after rebase. Thanks for the review. -- Michael unlogged-flush-fix-head-v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag
Hello, I wanted to review the patch. But the patch is applied with errors. I've rebased the local copy and have done review on it. I'm not sure is it properly to send rebased patch by reviewer, so I haven't sent it to avoid confuses. On 29.01.2017 17:00, Michael Paquier wrote: Attached is what I have in mind for HEAD. btree, gist, spgist and bloom indexes are changed so as the init forks created go through the shared buffers instead of having their empty() routines handle the flush of the page created. This removes any kind of race conditions between the checkpointer and the init fork creations, which is I think a good thing. I think this is good fixes. I've checked them. And in my opinion they are correct. The code also is good. Here are the tests I have done. First running those commands to create all types of indexes. create extension bloom; create extension btree_gist; create extension btree_gin; create unlogged table foo (a int); create index foo_bt on foo(a); create index foo_bloom on foo using bloom(a); create index foo_gin on foo using gin (a); create index foo_gist on foo using gist (a); create index foo_brin on foo using brin (a); create unlogged table foo_geo (a box); create index foo_spgist ON foo_geo using spgist(a); checkpoint; Then crash the server, restart it, and the following vacuums are able to complete. vacuum foo; vacuum foo_geo; I've done this tests. Before the patch server crashes on vacuum command. After applying the patch server doesn't crash on vacuum command. I have run regression and TAP tests. They all passed without error. I think the patch can be marked as "Ready for Committer" after rebase. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag
On Thu, Jan 26, 2017 at 9:14 AM, Michael Paquierwrote: > So the patch attached fixes the problem by changing BufferAlloc() in > such a way that initialization forks are permanently written to disk, > which is what you are suggesting. As a simple fix for back-branches > that's enough, though on HEAD I think that we should really rework the > empty() routines so as the write goes through shared buffers first, > that seems more solid than relying on the sgmr routines to do this > work. Robert, what do you think? Attached is what I have in mind for HEAD. btree, gist, spgist and bloom indexes are changed so as the init forks created go through the shared buffers instead of having their empty() routines handle the flush of the page created. This removes any kind of race conditions between the checkpointer and the init fork creations, which is I think a good thing. Here are the tests I have done. First running those commands to create all types of indexes. create extension bloom; create extension btree_gist; create extension btree_gin; create unlogged table foo (a int); create index foo_bt on foo(a); create index foo_bloom on foo using bloom(a); create index foo_gin on foo using gin (a); create index foo_gist on foo using gist (a); create index foo_brin on foo using brin (a); create unlogged table foo_geo (a box); create index foo_spgist ON foo_geo using spgist(a); checkpoint; Then crash the server, restart it, and the following vacuums are able to complete. vacuum foo; vacuum foo_geo; I have as well created a CF entry for this set of patches: https://commitfest.postgresql.org/13/971/ Thanks, -- Michael diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c index 29bc5cea79..d04298add5 100644 --- a/contrib/bloom/blinsert.c +++ b/contrib/bloom/blinsert.c @@ -158,31 +158,24 @@ blbuild(Relation heap, Relation index, IndexInfo *indexInfo) void blbuildempty(Relation index) { - Page metapage; + Buffer MetaBuffer; - /* Construct metapage. */ - metapage = (Page) palloc(BLCKSZ); - BloomFillMetapage(index, metapage); + /* An empty bloom index has one meta page */ + MetaBuffer = + ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL); + LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE); - /* - * Write the page and log it. It might seem that an immediate sync - * would be sufficient to guarantee that the file exists on disk, but - * recovery itself might remove it while replaying, for example, an - * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we - * need this even when wal_level=minimal. - */ - PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO); - smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO, - (char *) metapage, true); - log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM, -BLOOM_METAPAGE_BLKNO, metapage, false); + /* Initialize the meta buffer */ + BloomFillMetapage(index, BufferGetPage(MetaBuffer)); - /* - * An immediate sync is required even if we xlog'd the page, because the - * write did not go through shared_buffers and therefore a concurrent - * checkpoint may have moved the redo pointer past our xlog record. - */ - smgrimmedsync(index->rd_smgr, INIT_FORKNUM); + /* And log it */ + START_CRIT_SECTION(); + MarkBufferDirty(MetaBuffer); + log_newpage_buffer(MetaBuffer, false); + END_CRIT_SECTION(); + + /* Unlock and release the buffer */ + UnlockReleaseBuffer(MetaBuffer); } /* diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 1bb1acfea6..215d3c6c02 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -23,6 +23,7 @@ #include "access/xlog.h" #include "catalog/index.h" #include "commands/vacuum.h" +#include "miscadmin.h" #include "storage/indexfsm.h" #include "storage/ipc.h" #include "storage/lmgr.h" @@ -237,31 +238,22 @@ btbuildCallback(Relation index, void btbuildempty(Relation index) { - Page metapage; - - /* Construct metapage. */ - metapage = (Page) palloc(BLCKSZ); - _bt_initmetapage(metapage, P_NONE, 0); - - /* - * Write the page and log it. It might seem that an immediate sync - * would be sufficient to guarantee that the file exists on disk, but - * recovery itself might remove it while replaying, for example, an - * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we - * need this even when wal_level=minimal. - */ - PageSetChecksumInplace(metapage, BTREE_METAPAGE); - smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE, - (char *) metapage, true); - log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM, -BTREE_METAPAGE, metapage, false); - - /* - * An immediate sync is required even if we xlog'd the page, because the - * write did not go through shared_buffers and therefore a concurrent - * checkpoint may have moved the redo pointer past our xlog record. - */ - smgrimmedsync(index->rd_smgr, INIT_FORKNUM); + Buffer metabuffer; + + /* An empty btree index has one meta page */
Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag
(Adding Robert in CC.) On Thu, Jan 26, 2017 at 4:34 AM, Wang Haowrote: > An unlogged table has an initialization fork. The initialization fork does > not have an BM_PERMANENT flag when get a buffer. > In checkpoint (not shutdown or end of recovery), it will not write to disk. > after a crash recovery, the page of initialization fork will not correctly, > then make the main fork not correctly too. For init forks the flush need absolutely to happen, so that's really not good. We ought to fix BufferAlloc() appropriately here. > Here is an example for GIN index. > > create unlogged table gin_test_tbl(i int4[]); > create index gin_test_idx on gin_test_tbl using gin (i); > checkpoint; > > kill all the postgres process, and restart again. > > vacuum gin_test_tbl; -- crash. > > It seems have same problem in BRIN, GIN, GiST and HASH index which using > buffer for meta page initialize in ambuildempty function. Yeah, other index AMs deal directly with the sync of the page, that's why there is no issue for them. So the patch attached fixes the problem by changing BufferAlloc() in such a way that initialization forks are permanently written to disk, which is what you are suggesting. As a simple fix for back-branches that's enough, though on HEAD I think that we should really rework the empty() routines so as the write goes through shared buffers first, that seems more solid than relying on the sgmr routines to do this work. Robert, what do you think? -- Michael unlogged-flush-fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag
An unlogged table has an initialization fork. The initialization fork does not have an BM_PERMANENT flag when get a buffer. In checkpoint (not shutdown or end of recovery), it will not write to disk. after a crash recovery, the page of initialization fork will not correctly, then make the main fork not correctly too. Here is an example for GIN index. create unlogged table gin_test_tbl(i int4[]); create index gin_test_idx on gin_test_tbl using gin (i); checkpoint; kill all the postgres process, and restart again. vacuum gin_test_tbl; -- crash. It seems have same problem in BRIN, GIN, GiST and HASH index which using buffer for meta page initialize in ambuildempty function.