Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag

2017-03-14 Thread Michael Paquier
On Wed, Mar 15, 2017 at 1:13 AM, Robert Haas  wrote:
> On Mon, Mar 13, 2017 at 8:51 PM, Michael Paquier
>  wrote:
>>> I agree with that, but I propose the attached version instead.  It
>>> seems cleaner to have the entire test for setting BM_PERMANENT in one
>>> place rather than splitting it up as you did.
>>
>> Fine for me. You may want to update the comment of BM_PERMANENT in
>> buf_internals.h as Artur has mentioned upthread. For example by just
>> adding "and init forks".
>
> OK, done, and back-patched all the way.

Thanks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag

2017-03-14 Thread Robert Haas
On Mon, Mar 13, 2017 at 8:51 PM, Michael Paquier
 wrote:
>> I agree with that, but I propose the attached version instead.  It
>> seems cleaner to have the entire test for setting BM_PERMANENT in one
>> place rather than splitting it up as you did.
>
> Fine for me. You may want to update the comment of BM_PERMANENT in
> buf_internals.h as Artur has mentioned upthread. For example by just
> adding "and init forks".

OK, done, and back-patched all the way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag

2017-03-13 Thread Michael Paquier
On Tue, Mar 14, 2017 at 4:46 AM, Robert Haas  wrote:
> On Wed, Jan 25, 2017 at 7:14 PM, Michael Paquier
>  wrote:
>> (Adding Robert in CC.)
>>
>> On Thu, Jan 26, 2017 at 4:34 AM, Wang Hao  wrote:
>>> An unlogged table has an initialization fork. The initialization fork does
>>> not have an BM_PERMANENT flag when get a buffer.
>>> In checkpoint (not shutdown or end of recovery), it will not write to disk.
>>> after a crash recovery, the page of initialization fork will not correctly,
>>> then make the main fork not correctly too.
>>
>> For init forks the flush need absolutely to happen, so that's really
>> not good. We ought to fix BufferAlloc() appropriately here.
>
> I agree with that, but I propose the attached version instead.  It
> seems cleaner to have the entire test for setting BM_PERMANENT in one
> place rather than splitting it up as you did.

Fine for me. You may want to update the comment of BM_PERMANENT in
buf_internals.h as Artur has mentioned upthread. For example by just
adding "and init forks".

> I believe this sets a record for the longest-lived data corruption bug
> in a commit made by me.

Really? I'll need to double-check the git history here.

> Six years and change, woohoo.  :-(

And that much for someone to report it.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag

2017-03-13 Thread Robert Haas
On Wed, Jan 25, 2017 at 7:14 PM, Michael Paquier
 wrote:
> (Adding Robert in CC.)
>
> On Thu, Jan 26, 2017 at 4:34 AM, Wang Hao  wrote:
>> An unlogged table has an initialization fork. The initialization fork does
>> not have an BM_PERMANENT flag when get a buffer.
>> In checkpoint (not shutdown or end of recovery), it will not write to disk.
>> after a crash recovery, the page of initialization fork will not correctly,
>> then make the main fork not correctly too.
>
> For init forks the flush need absolutely to happen, so that's really
> not good. We ought to fix BufferAlloc() appropriately here.

I agree with that, but I propose the attached version instead.  It
seems cleaner to have the entire test for setting BM_PERMANENT in one
place rather than splitting it up as you did.

I believe this sets a record for the longest-lived data corruption bug
in a commit made by me.  Six years and change, woohoo.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


unlogged-flush-fix-rmh.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag

2017-03-10 Thread Michael Paquier
On Sat, Mar 11, 2017 at 12:03 AM, Artur Zakirov
 wrote:
> Because BM_PERMANENT is used for init forks of unlogged indexes now.

Yes, indeed.
-- 
Michael
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index 913f1f8a51..3557b106d8 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -158,31 +158,24 @@ blbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 blbuildempty(Relation index)
 {
-	Page		metapage;
+	Buffer		MetaBuffer;
 
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
-	BloomFillMetapage(index, metapage);
+	/* An empty bloom index has one meta page */
+	MetaBuffer =
+		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE);
 
-	/*
-	 * Write the page and log it.  It might seem that an immediate sync
-	 * would be sufficient to guarantee that the file exists on disk, but
-	 * recovery itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we
-	 * need this even when wal_level=minimal.
-	 */
-	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
-			  (char *) metapage, true);
-	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-BLOOM_METAPAGE_BLKNO, metapage, false);
+	/* Initialize the meta buffer */
+	BloomFillMetapage(index, BufferGetPage(MetaBuffer));
 
-	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	/* And log it */
+	START_CRIT_SECTION();
+	MarkBufferDirty(MetaBuffer);
+	log_newpage_buffer(MetaBuffer, false);
+	END_CRIT_SECTION();
+
+	/* Unlock and release the buffer */
+	UnlockReleaseBuffer(MetaBuffer);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 775f2ff1f8..e1801ea939 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
+#include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
@@ -282,31 +283,22 @@ btbuildCallback(Relation index,
 void
 btbuildempty(Relation index)
 {
-	Page		metapage;
-
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
-
-	/*
-	 * Write the page and log it.  It might seem that an immediate sync
-	 * would be sufficient to guarantee that the file exists on disk, but
-	 * recovery itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we
-	 * need this even when wal_level=minimal.
-	 */
-	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
-	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-BTREE_METAPAGE, metapage, false);
-
-	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	Buffer		metabuffer;
+
+	/* An empty btree index has one meta page */
+	metabuffer =
+		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(metabuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/* Initialize and xlog meta buffer */
+	START_CRIT_SECTION();
+	_bt_initmetapage(BufferGetPage(metabuffer), P_NONE, 0);
+	MarkBufferDirty(metabuffer);
+	log_newpage_buffer(metabuffer, false);
+	END_CRIT_SECTION();
+
+	/* Unlock and release the buffer */
+	UnlockReleaseBuffer(metabuffer);
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 00a0ab4438..4252c2eb53 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -156,49 +156,50 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 spgbuildempty(Relation index)
 {
-	Page		page;
-
-	/* Construct metapage. */
-	page = (Page) palloc(BLCKSZ);
-	SpGistInitMetapage(page);
+	Buffer		MetaBuffer,
+RootBuffer,
+TupleBuffer;
 
 	/*
-	 * Write the page and log it unconditionally.  This is important
-	 * particularly for indexes created on tablespaces and databases
-	 * whose creation happened after the last redo pointer as recovery
-	 * removes any of their existing content when the corresponding
-	 * create records are replayed.
+	 * An empty SPGist index has three pages:
+	 * - one meta page.
+	 * - one root page.
+	 * - one null-tuple root page.
 	 */
-	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, 

Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag

2017-03-10 Thread Artur Zakirov

On 10.03.2017 04:00, Michael Paquier wrote:

On Thu, Mar 9, 2017 at 10:25 PM, Artur Zakirov  wrote:

I think this is good fixes. I've checked them. And in my opinion they are
correct.

The code also is good.


Having something with conflicts is not nice, so attached is a rebased version.


Thank you!

I've rerun regression and TAP tests. They all passed.

Also maybe it will be good to fix comments.

In buf_internals.h:

#define BM_PERMANENT(1U << 31)/* permanent relation (not
   * unlogged) */


And in FlushBuffer():

/*
 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
 * rule that log updates must hit disk before any of the data-file changes
 * they describe do.
 *
 * However, this rule does not apply to unlogged relations, which will be
 * lost after a crash anyway.  Most unlogged relation pages do not bear


Because BM_PERMANENT is used for init forks of unlogged indexes now.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag

2017-03-09 Thread Michael Paquier
On Thu, Mar 9, 2017 at 10:25 PM, Artur Zakirov  wrote:
> I think this is good fixes. I've checked them. And in my opinion they are
> correct.
>
> The code also is good.

Having something with conflicts is not nice, so attached is a rebased version.

> I have run regression and TAP tests. They all passed without error.
>
> I think the patch can be marked as "Ready for Committer" after rebase.

Thanks for the review.
-- 
Michael


unlogged-flush-fix-head-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag

2017-03-09 Thread Artur Zakirov

Hello,

I wanted to review the patch. But the patch is applied with errors. I've 
rebased the local copy and have done review on it. I'm not sure is it 
properly to send rebased patch by reviewer, so I haven't sent it to 
avoid confuses.


On 29.01.2017 17:00, Michael Paquier wrote:

Attached is what I have in mind for HEAD. btree, gist, spgist and
bloom indexes are changed so as the init forks created go through the
shared buffers instead of having their empty() routines handle the
flush of the page created. This removes any kind of race conditions
between the checkpointer and the init fork creations, which is I think
a good thing.


I think this is good fixes. I've checked them. And in my opinion they 
are correct.


The code also is good.



Here are the tests I have done.
First running those commands to create all types of indexes.
create extension bloom;
create extension btree_gist;
create extension btree_gin;
create unlogged table foo (a int);
create index foo_bt on foo(a);
create index foo_bloom on foo using bloom(a);
create index foo_gin on foo using gin (a);
create index foo_gist on foo using gist (a);
create index foo_brin on foo using brin (a);
create unlogged table foo_geo (a box);
create index foo_spgist ON foo_geo using spgist(a);
checkpoint;

Then crash the server, restart it, and the following vacuums are able
to complete.
vacuum foo;
vacuum foo_geo;



I've done this tests. Before the patch server crashes on vacuum command. 
After applying the patch server doesn't crash on vacuum command.


I have run regression and TAP tests. They all passed without error.

I think the patch can be marked as "Ready for Committer" after rebase.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag

2017-01-29 Thread Michael Paquier
On Thu, Jan 26, 2017 at 9:14 AM, Michael Paquier
 wrote:
> So the patch attached fixes the problem by changing BufferAlloc() in
> such a way that initialization forks are permanently written to disk,
> which is what you are suggesting. As a simple fix for back-branches
> that's enough, though on HEAD I think that we should really rework the
> empty() routines so as the write goes through shared buffers first,
> that seems more solid than relying on the sgmr routines to do this
> work. Robert, what do you think?

Attached is what I have in mind for HEAD. btree, gist, spgist and
bloom indexes are changed so as the init forks created go through the
shared buffers instead of having their empty() routines handle the
flush of the page created. This removes any kind of race conditions
between the checkpointer and the init fork creations, which is I think
a good thing.

Here are the tests I have done.
First running those commands to create all types of indexes.
create extension bloom;
create extension btree_gist;
create extension btree_gin;
create unlogged table foo (a int);
create index foo_bt on foo(a);
create index foo_bloom on foo using bloom(a);
create index foo_gin on foo using gin (a);
create index foo_gist on foo using gist (a);
create index foo_brin on foo using brin (a);
create unlogged table foo_geo (a box);
create index foo_spgist ON foo_geo using spgist(a);
checkpoint;

Then crash the server, restart it, and the following vacuums are able
to complete.
vacuum foo;
vacuum foo_geo;

I have as well created a CF entry for this set of patches:
https://commitfest.postgresql.org/13/971/

Thanks,
-- 
Michael
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index 29bc5cea79..d04298add5 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -158,31 +158,24 @@ blbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 blbuildempty(Relation index)
 {
-	Page		metapage;
+	Buffer		MetaBuffer;
 
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
-	BloomFillMetapage(index, metapage);
+	/* An empty bloom index has one meta page */
+	MetaBuffer =
+		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE);
 
-	/*
-	 * Write the page and log it.  It might seem that an immediate sync
-	 * would be sufficient to guarantee that the file exists on disk, but
-	 * recovery itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we
-	 * need this even when wal_level=minimal.
-	 */
-	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
-			  (char *) metapage, true);
-	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-BLOOM_METAPAGE_BLKNO, metapage, false);
+	/* Initialize the meta buffer */
+	BloomFillMetapage(index, BufferGetPage(MetaBuffer));
 
-	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	/* And log it */
+	START_CRIT_SECTION();
+	MarkBufferDirty(MetaBuffer);
+	log_newpage_buffer(MetaBuffer, false);
+	END_CRIT_SECTION();
+
+	/* Unlock and release the buffer */
+	UnlockReleaseBuffer(MetaBuffer);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1bb1acfea6..215d3c6c02 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
+#include "miscadmin.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -237,31 +238,22 @@ btbuildCallback(Relation index,
 void
 btbuildempty(Relation index)
 {
-	Page		metapage;
-
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
-
-	/*
-	 * Write the page and log it.  It might seem that an immediate sync
-	 * would be sufficient to guarantee that the file exists on disk, but
-	 * recovery itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we
-	 * need this even when wal_level=minimal.
-	 */
-	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
-	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-BTREE_METAPAGE, metapage, false);
-
-	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	Buffer		metabuffer;
+
+	/* An empty btree index has one meta page */

Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag

2017-01-25 Thread Michael Paquier
(Adding Robert in CC.)

On Thu, Jan 26, 2017 at 4:34 AM, Wang Hao  wrote:
> An unlogged table has an initialization fork. The initialization fork does
> not have an BM_PERMANENT flag when get a buffer.
> In checkpoint (not shutdown or end of recovery), it will not write to disk.
> after a crash recovery, the page of initialization fork will not correctly,
> then make the main fork not correctly too.

For init forks the flush need absolutely to happen, so that's really
not good. We ought to fix BufferAlloc() appropriately here.

> Here is an example for GIN index.
>
> create unlogged table gin_test_tbl(i int4[]);
> create index gin_test_idx on gin_test_tbl using gin (i);
> checkpoint;
>
> kill all the postgres process, and restart again.
>
> vacuum gin_test_tbl;  -- crash.
>
> It seems have same problem in BRIN, GIN, GiST and HASH index which using
> buffer for meta page initialize in ambuildempty function.

Yeah, other index AMs deal directly with the sync of the page, that's
why there is no issue for them.

So the patch attached fixes the problem by changing BufferAlloc() in
such a way that initialization forks are permanently written to disk,
which is what you are suggesting. As a simple fix for back-branches
that's enough, though on HEAD I think that we should really rework the
empty() routines so as the write goes through shared buffers first,
that seems more solid than relying on the sgmr routines to do this
work. Robert, what do you think?
-- 
Michael


unlogged-flush-fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag

2017-01-25 Thread Wang Hao
An unlogged table has an initialization fork. The initialization fork does
not have an BM_PERMANENT flag when get a buffer.
In checkpoint (not shutdown or end of recovery), it will not write to disk.
after a crash recovery, the page of initialization fork will not correctly,
then make the main fork not correctly too.

Here is an example for GIN index.

create unlogged table gin_test_tbl(i int4[]);
create index gin_test_idx on gin_test_tbl using gin (i);
checkpoint;

kill all the postgres process, and restart again.

vacuum gin_test_tbl;  -- crash.

It seems have same problem in BRIN, GIN, GiST and HASH index which using
buffer for meta page initialize in ambuildempty function.