subject:"Assert in heapgettup_pagemode\(\) fails due to underlying buffer change"

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-09 Thread Thomas Munro

I adjusted the code a bit more to look like the 16 coding including
restoring some very useful comments that had been lost, and pushed.

Thanks very much to Alexander and Noah for (independently) chasing
this down and reporting, testing etc, and Alvaro and Robert for review
comments.

(Passing thought: it's a bit weird that we need to zero pages at all
before restoring FPIs or initialising them.  Perhaps there is some way
to defer marking the buffer valid until after the caller gets a chance
to initialise?  Or something like that...)

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-08 Thread Thomas Munro

On Fri, Jun 7, 2024 at 8:05 PM Alvaro Herrera  wrote:
> In passing, I noticed that WaitReadBuffers has zero comments, which
> seems an insufficient number of them.

Ack.  Here is a patch for that.  I guess I hadn't put a comment there
because it's hard to write anything without sort of duplicating what
is already said by the StartReadBuffers() comment and doubling down on
descriptions of future plans... but, in for a penny, in for a pound as
they say.
From db5e56825e4b4c595885373c413011c50fedd3e8 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 9 Jun 2024 09:51:02 +1200
Subject: [PATCH] Add comment for WaitReadBuffers().

Per complaint from Alvaro while reviewing something else.

Reported-by: Alvaro Herrera 
Discussion: https://postgr.es/m/202406070805.icofqromovvn%40alvherre.pgsql
---
 src/backend/storage/buffer/bufmgr.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a8e3377263f..409210a6ed2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1371,6 +1371,17 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
+/*
+ * The second step of a StartReadBuffers(), WaitReadBuffers() sequence.  It is
+ * only necessary to call this if StartReadBuffers() returned true, indicating
+ * that I/O was necessary.
+ *
+ * Currently, this function performs synchronous I/O system calls.  The earlier
+ * StartReadBuffers() call might have issued advice to give the OS a head
+ * start so that it completes quickly.  In future work, a true asynchronous I/O
+ * could be started by StartReadBuffers(), and this function would wait for it
+ * to complete.
+ */
 void
 WaitReadBuffers(ReadBuffersOperation *operation)
 {
-- 
2.45.1

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-08 Thread Thomas Munro

New version.  Same code as v2, but comments improved to explain the
reasoning, with reference to README's buffer access rules.  I'm
planning to push this soon if there are no objections.
From 1fa26f407622cd69d82f3b4ea68fddf2c357cf06 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 7 Jun 2024 17:49:19 +1200
Subject: [PATCH v3] Fix RBM_ZERO_AND_LOCK.

Commit 210622c6 accidentally zeroed out pages even if they were found in
the buffer pool.  It should always lock the page, but it should only
zero pages that were not already found as an optimization to avoid I/O.
Otherwise, concurrent readers that hold only a pin might see corrupted
page contents changing under their feet.  This is done with the standard
BM_IO_IN_PROGRESS infrastructure to test for BM_VALID and wake
concurrent waiters without races (as it was in earlier releases).

Reported-by: Noah Misch 
Reported-by: Alexander Lakhin 
Reviewed-by: Alvaro Herrera  (earlier version)
Reviewed-by: Robert Haas  (earlier version)
Discussion: https://postgr.es/m/20240512171658.7e.nmi...@google.com
Discussion: https://postgr.es/m/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 64 -
 1 file changed, 45 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 49637284f91..a8e3377263f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1010,43 +1010,69 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 }
 
 /*
- * Zero a buffer and lock it, as part of the implementation of
+ * Lock and optionally zero a buffer, as part of the implementation of
  * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
- * pinned.  It does not have to be valid, but it is valid and locked on
- * return.
+ * pinned.  If the buffer is not already valid, it is zeroed and made valid.
  */
 static void
-ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 {
 	BufferDesc *bufHdr;
-	uint32		buf_state;
+	bool		need_to_zero;
 
 	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
 
-	if (BufferIsLocal(buffer))
+	if (already_valid)
+	{
+		/*
+		 * If the caller already knew the buffer was valid, we can skip some
+		 * header interaction.  The caller just wants to lock the buffer.
+		 */
+		need_to_zero = false;
+	}
+	else if (BufferIsLocal(buffer))
+	{
+		/* Simple case for non-shared buffers. */
 		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		need_to_zero = (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
+	}
 	else
 	{
+		/*
+		 * Take BM_IO_IN_PROGRESS, or discover that BM_VALID has been set
+		 * concurrently.  Even though we aren't doing I/O, that protocol
+		 * ensures that we don't zero a page that someone else has pinned.  An
+		 * exclusive content lock wouldn't be enough, because readers are
+		 * allowed to drop the content lock after determining that a tuple is
+		 * visible (see buffer access rules in README).
+		 */
 		bufHdr = GetBufferDescriptor(buffer - 1);
+		need_to_zero = StartBufferIO(bufHdr, true, false);
+	}
+
+	if (need_to_zero)
+		memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	/* Acquire the requested lock before setting BM_VALID. */
+	if (!BufferIsLocal(buffer))
+	{
 		if (mode == RBM_ZERO_AND_LOCK)
 			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 		else
 			LockBufferForCleanup(buffer);
 	}
 
-	memset(BufferGetPage(buffer), 0, BLCKSZ);
-
-	if (BufferIsLocal(buffer))
-	{
-		buf_state = pg_atomic_read_u32(&bufHdr->state);
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
+	/*
+	 * Set BM_VALID, and wake anyone waiting for BM_IO_IN_PROGRESS to be
+	 * cleared.
+	 */
+	if (need_to_zero)
 	{
-		buf_state = LockBufHdr(bufHdr);
-		buf_state |= BM_VALID;
-		UnlockBufHdr(bufHdr, buf_state);
+		if (BufferIsLocal(buffer))
+			pg_atomic_write_u32(&bufHdr->state,
+pg_atomic_read_u32(&bufHdr->state) | BM_VALID);
+		else
+			TerminateBufferIO(bufHdr, false, BM_VALID, true);
 	}
 }
 
@@ -1185,7 +1211,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 
 		buffer = PinBufferForBlock(rel, smgr, smgr_persistence,
    forkNum, blockNum, strategy, &found);
-		ZeroBuffer(buffer, mode);
+		ZeroAndLockBuffer(buffer, mode, found);
 		return buffer;
 	}
 
-- 
2.45.1

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-07 Thread Thomas Munro

On Sat, Jun 8, 2024 at 12:47 AM Robert Haas  wrote:
>
> On Fri, Jun 7, 2024 at 4:05 AM Alvaro Herrera  wrote:
> > >  static void
> > > -ZeroBuffer(Buffer buffer, ReadBufferMode mode)
> > > +ZeroBuffer(Buffer buffer, ReadBufferMode mode, bool zero)
> >
> > This change makes the API very strange.  Should the function be called
> > ZeroAndLockBuffer() instead?  Then the addition of a "bool zero"
> > argument makes a lot more sense.
>
> I agree that's better, but it still looks a bit weird. You have to
> realize that 'bool zero' means 'is already zeroed' here -- or at
> least, I guess that's the intention. But then I wonder why you'd call
> a function called ZeroAndLockBuffer if all you need to do is
> LockBuffer.

The name weirdness comes directly from RBM_ZERO_AND_LOCK (the fact
that it doesn't always zero despite shouting ZERO is probably what
temporarily confused me).  But coming up with a better name is hard
and I certainly don't propose to change it now.  I think it's
reasonable for this internal helper function to have that matching
name as Alvaro suggested, with a good comment about that.

Even though that quick-demonstration change fixed the two reported
repros, I think it is still probably racy (or if it isn't, it relies
on higher level interlocking that I don't want to rely on).  This case
really should be using the standard StartBufferIO/TerminateBufferIO
infrastructure as it was before.  I had moved that around to deal with
multi-block I/O, but dropped the ball on the zero case... sorry.

Here's a version like that.  The "zero" argument (yeah that was not a
good name) is now inverted and called "already_valid", but it's only a
sort of optimisation for the case where we already know for sure that
it's valid.  If it isn't, we do the standard
BM_IO_IN_PROGRESS/BM_VALID/CV dance, for correct interaction with any
concurrent read or zero operation.

v2-0001-Fix-RBM_ZERO_AND_LOCK.patch
Description: Binary data

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-07 Thread Robert Haas

On Fri, Jun 7, 2024 at 4:05 AM Alvaro Herrera  wrote:
> >  static void
> > -ZeroBuffer(Buffer buffer, ReadBufferMode mode)
> > +ZeroBuffer(Buffer buffer, ReadBufferMode mode, bool zero)
>
> This change makes the API very strange.  Should the function be called
> ZeroAndLockBuffer() instead?  Then the addition of a "bool zero"
> argument makes a lot more sense.

I agree that's better, but it still looks a bit weird. You have to
realize that 'bool zero' means 'is already zeroed' here -- or at
least, I guess that's the intention. But then I wonder why you'd call
a function called ZeroAndLockBuffer if all you need to do is
LockBuffer.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-07 Thread Alexander Lakhin


Hello Thomas,

07.06.2024 09:06, Thomas Munro wrote:

On Fri, Jun 7, 2024 at 3:06 PM Thomas Munro  wrote:

On Fri, Jun 7, 2024 at 3:00 PM Alexander Lakhin  wrote:

My bisect run ended with:
210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commit

Author: Thomas Munro 
Date:   Wed Apr 3 00:03:08 2024 +1300

  Provide vectored variant of ReadBuffer().

Other buildfarm failures with this Assert I could find kind of confirm this:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-03%2003%3A32%3A18
(presumably a first failure of this sort)
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-04-04%2015%3A38%3A16
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay&dt=2024-05-07%2004%3A00%3A08

Looking...

What Noah described[1] is what should be happening already, I think,
but 210622c6 unconditionally zeroed the page.  Oops.  The attached
seems to cure his repro for me.  Does it also cure your test?  I
couldn't see that variant myself for some reason, but it seems to make
sense as the explanation.  I would probably adjust the function name
or perhaps consider refactoring slightly, but first let's confirm that
this is the same issue and fix.


Thank you for looking and for the fix!

Using the same testing procedure (applying patch for checking lpp,
multiplying 026_overwrite_contrecord.pl tests and running 30 tests in
parallel, with fsync=on) which I used for bisecting, I got failures on
iterations 8, 19, 4 without the fix, but with the fix applied, 125
iterations passed. I think The Cure is sound.

Best regards,
Alexander

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-07 Thread Alvaro Herrera

On 2024-Jun-07, Thomas Munro wrote:

>  static void
> -ZeroBuffer(Buffer buffer, ReadBufferMode mode)
> +ZeroBuffer(Buffer buffer, ReadBufferMode mode, bool zero)

This change makes the API very strange.  Should the function be called
ZeroAndLockBuffer() instead?  Then the addition of a "bool zero"
argument makes a lot more sense.

In passing, I noticed that WaitReadBuffers has zero comments, which
seems an insufficient number of them.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-06 Thread Thomas Munro

On Fri, Jun 7, 2024 at 3:06 PM Thomas Munro  wrote:
> On Fri, Jun 7, 2024 at 3:00 PM Alexander Lakhin  wrote:
> > My bisect run ended with:
> > 210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commit
> >
> > Author: Thomas Munro 
> > Date:   Wed Apr 3 00:03:08 2024 +1300
> >
> >  Provide vectored variant of ReadBuffer().
> >
> > Other buildfarm failures with this Assert I could find kind of confirm this:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-03%2003%3A32%3A18
> > (presumably a first failure of this sort)
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-04-04%2015%3A38%3A16
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay&dt=2024-05-07%2004%3A00%3A08
>
> Looking...

What Noah described[1] is what should be happening already, I think,
but 210622c6 unconditionally zeroed the page.  Oops.  The attached
seems to cure his repro for me.  Does it also cure your test?  I
couldn't see that variant myself for some reason, but it seems to make
sense as the explanation.  I would probably adjust the function name
or perhaps consider refactoring slightly, but first let's confirm that
this is the same issue and fix.

[1] 
https://www.postgresql.org/message-id/flat/20240512171658.7e.nmi...@google.com
From f3bb1d69a57bea820895efaf366371463e62235d Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 7 Jun 2024 17:49:19 +1200
Subject: [PATCH] Fix RBM_ZERO_AND_LOCK.

Commit 210622c6 accidentally zeroed out pages even if they were found in
the buffer pool.  It should always lock the page, but it should only
zero pages that were not already found as an optimization to avoid I/O.
Otherwise, concurrent readers that hold only a pin might see corrupted
page contents changing under their feet.

Reported-by: Noah Misch 
Reported-by: Alexander Lakhin 
Discussion: https://postgr.es/m/20240512171658.7e.nmi...@google.com
Discussion: https://postgr.es/m/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 49637284f91..238fc0e3547 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1016,7 +1016,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
  * return.
  */
 static void
-ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+ZeroBuffer(Buffer buffer, ReadBufferMode mode, bool zero)
 {
 	BufferDesc *bufHdr;
 	uint32		buf_state;
@@ -1034,7 +1034,8 @@ ZeroBuffer(Buffer buffer, ReadBufferMode mode)
 			LockBufferForCleanup(buffer);
 	}
 
-	memset(BufferGetPage(buffer), 0, BLCKSZ);
+	if (zero)
+		memset(BufferGetPage(buffer), 0, BLCKSZ);
 
 	if (BufferIsLocal(buffer))
 	{
@@ -1185,7 +1186,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 
 		buffer = PinBufferForBlock(rel, smgr, smgr_persistence,
    forkNum, blockNum, strategy, &found);
-		ZeroBuffer(buffer, mode);
+		ZeroBuffer(buffer, mode, !found);
 		return buffer;
 	}
 
-- 
2.45.1

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-06 Thread Thomas Munro

On Fri, Jun 7, 2024 at 3:00 PM Alexander Lakhin  wrote:
> My bisect run ended with:
> 210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commit
>
> Author: Thomas Munro 
> Date:   Wed Apr 3 00:03:08 2024 +1300
>
>  Provide vectored variant of ReadBuffer().
>
> Other buildfarm failures with this Assert I could find kind of confirm this:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-03%2003%3A32%3A18
> (presumably a first failure of this sort)
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-04-04%2015%3A38%3A16
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay&dt=2024-05-07%2004%3A00%3A08

Looking...

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-06 Thread Alexander Lakhin


Hello Noah,

06.06.2024 22:07, Noah Misch wrote:



I don't know, but if the locks are really missing now, I feel like the
first question is "which commit got rid of them?". It's a little hard
to believe that they've never been there and somehow nobody has
noticed.

Then again, maybe we have; see Noah's thread about in-place updates
breaking stuff and some of the surprising discoveries there. But it
seems worth investigating.

$SUBJECT looks more like a duplicate of
postgr.es/m/flat/20240512171658.7e.nmi...@google.com (Hot standby queries see
transient all-zeros pages).


Thank you for the reference! Yes, it looks very similar. Though I can't
say the sleep you proposed helps the failure reproduction (I've tried
026_overwrite_contrecord.pl and saw no more frequent failures or so).

My bisect run ended with:
210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commit

Author: Thomas Munro 
Date:   Wed Apr 3 00:03:08 2024 +1300

    Provide vectored variant of ReadBuffer().

Other buildfarm failures with this Assert I could find kind of confirm this:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-03%2003%3A32%3A18
(presumably a first failure of this sort)
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-04-04%2015%3A38%3A16
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay&dt=2024-05-07%2004%3A00%3A08

Best regards,
Alexander

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-06 Thread Noah Misch

On Thu, Jun 06, 2024 at 12:36:32PM -0400, Robert Haas wrote:
> On Thu, Jun 6, 2024 at 6:00 AM Alexander Lakhin  wrote:
> > Am I missing something or the the page buffer indeed lacks locking there?
> 
> I don't know, but if the locks are really missing now, I feel like the
> first question is "which commit got rid of them?". It's a little hard
> to believe that they've never been there and somehow nobody has
> noticed.
> 
> Then again, maybe we have; see Noah's thread about in-place updates
> breaking stuff and some of the surprising discoveries there. But it
> seems worth investigating.

$SUBJECT looks more like a duplicate of
postgr.es/m/flat/20240512171658.7e.nmi...@google.com (Hot standby queries see
transient all-zeros pages).

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-06 Thread Alexander Lakhin


Hello Robert,

06.06.2024 19:36, Robert Haas wrote:

On Thu, Jun 6, 2024 at 6:00 AM Alexander Lakhin  wrote:

Am I missing something or the the page buffer indeed lacks locking there?

I don't know, but if the locks are really missing now, I feel like the
first question is "which commit got rid of them?". It's a little hard
to believe that they've never been there and somehow nobody has
noticed.

Then again, maybe we have; see Noah's thread about in-place updates
breaking stuff and some of the surprising discoveries there. But it
seems worth investigating.


Yes, my last experiment with memcmp for the whole buffer was wrong,
given the comment above heapgettup_pagemode(). I think the correct
check would be:
 ItemId  lpp;
 OffsetNumber lineoff;
+ItemIdData  iid;

 lineoff = scan->rs_vistuples[lineindex];
 lpp = PageGetItemId(page, lineoff);
+iid = *((ItemIdData *)lpp);
+for (int i = 0; i < 1000; i++)
+Assert(memcmp(&iid, lpp, sizeof(iid)) == 0);

It significantly alleviates reproducing of the test failure for me.
Will try to bisect this anomaly tomorrow.

Best regards,
Alexander

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-06 Thread Robert Haas

On Thu, Jun 6, 2024 at 6:00 AM Alexander Lakhin  wrote:
> Am I missing something or the the page buffer indeed lacks locking there?

I don't know, but if the locks are really missing now, I feel like the
first question is "which commit got rid of them?". It's a little hard
to believe that they've never been there and somehow nobody has
noticed.

Then again, maybe we have; see Noah's thread about in-place updates
breaking stuff and some of the surprising discoveries there. But it
seems worth investigating.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-06 Thread Alexander Lakhin


Hello hackers,

I tried to investigate a recent buildfarm test failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-06-04%2003%3A27%3A47

 29/295 postgresql:recovery / recovery/026_overwrite_contrecord ERROR   
 39.55s   exit status 32

log/026_overwrite_contrecord_standby.log
TRAP: failed Assert("ItemIdIsNormal(lpp)"), File: 
"../pgsql/src/backend/access/heap/heapam.c", Line: 1002, PID: 3740958
postgres: standby: bf postgres [local] 
startup(ExceptionalCondition+0x81)[0x56c60bf9]
postgres: standby: bf postgres [local] startup(+0xf776e)[0x5667276e]
postgres: standby: bf postgres [local] 
startup(heap_getnextslot+0x40)[0x56672ee1]
postgres: standby: bf postgres [local] startup(+0x11c218)[0x56697218]
postgres: standby: bf postgres [local] 
startup(systable_getnext+0xfa)[0x56697c1a]
postgres: standby: bf postgres [local] startup(+0x6d29c7)[0x56c4d9c7]
postgres: standby: bf postgres [local] startup(+0x6d372c)[0x56c4e72c]
postgres: standby: bf postgres [local] startup(+0x6d8288)[0x56c53288]
postgres: standby: bf postgres [local] 
startup(RelationCacheInitializePhase3+0x149)[0x56c52d71]

(It's not the only failure of that ilk in the buildfarm.)

and managed to reproduce the failure by running many
026_overwrite_contrecord tests in parallel (with fsync=on).

Analyzing the core dump added some info:
...
#3  0x00bb43cc in ExceptionalCondition (conditionName=0xc45c77 
"ItemIdIsNormal(lpp)",
    fileName=0xc45aa8 "heapam.c", lineNumber=1002) at assert.c:66
#4  0x004f7f13 in heapgettup_pagemode (scan=0x19f5660, 
dir=ForwardScanDirection, nkeys=2, key=0x19f61d0)
    at heapam.c:1002
#5  0x004f86d1 in heap_getnextslot (sscan=0x19f5660, 
direction=ForwardScanDirection, slot=0x19f5da0)
    at heapam.c:1307
#6  0x0051d028 in table_scan_getnextslot (sscan=0x19f5660, 
direction=ForwardScanDirection, slot=0x19f5da0)
    at ../../../../src/include/access/tableam.h:1081
#7  0x0051da80 in systable_getnext (sysscan=0x19f5470) at genam.c:530
#8  0x00ba0937 in RelationBuildTupleDesc (relation=0x7fa004feea88) at 
relcache.c:572
#9  0x00ba17b9 in RelationBuildDesc (targetRelId=2679, insertIt=true) 
at relcache.c:1184
#10 0x00ba6520 in load_critical_index (indexoid=2679, heapoid=2610) at 
relcache.c:4353
#11 0x00ba607d in RelationCacheInitializePhase3 () at relcache.c:4132
#12 0x00bcb704 in InitPostgres (in_dbname=0x196ca30 "postgres", dboid=5, 
username=0x19a91b8 "law", useroid=0,
    flags=1, out_dbname=0x0) at postinit.c:1193
...
(gdb) frame 4
(gdb) p lpp->lp_flags
$2 = 1
(gdb) p ItemIdIsNormal(lpp)
$12 = 1

So it looks like the Assert had failed when lpp->lp_flags had some other
contents...

I added the following debugging code:
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -995,10 +995,14 @@ continue_page:
    for (; linesleft > 0; linesleft--, lineindex += dir)
    {
    ItemId  lpp;
+   ItemIdData  iid;
    OffsetNumber lineoff;

    lineoff = scan->rs_vistuples[lineindex];
    lpp = PageGetItemId(page, lineoff);
+   iid = *((ItemIdData *)lpp);
+
+   Assert(ItemIdIsNormal(&iid));
    Assert(ItemIdIsNormal(lpp));

and got:
...
#2  0x55b68dc6998c in ExceptionalCondition (conditionName=0x55b68dcfe5f7 
"ItemIdIsNormal(&iid)",
    fileName=0x55b68dcfe428 "heapam.c", lineNumber=1010) at assert.c:66
#3  0x55b68d588a78 in heapgettup_pagemode (scan=0x55b68f0905e0, 
dir=ForwardScanDirection, nkeys=2,
    key=0x55b68f091150) at heapam.c:1010
#4  0x55b68d58930e in heap_getnextslot (sscan=0x55b68f0905e0, 
direction=ForwardScanDirection, slot=0x55b68f090d20)
    at heapam.c:1322
...
(gdb) frame 3
#3  0x55b68d588a78 in heapgettup_pagemode (...) at heapam.c:1010
1010    Assert(ItemIdIsNormal(&iid));
(gdb) info locals
lpp = 0x7f615c34b0ec
iid = {lp_off = 0, lp_flags = 0, lp_len = 0}
lineoff = 54
tuple = 0x55b68f090638
page = 0x7f615c34b000 ""

(gdb) p *lpp
$1 = {lp_off = 3160, lp_flags = 1, lp_len = 136}

It seemingly confirms that the underlying memory was changed while being
processed in heapgettup_pagemode().

I've tried to add checks for the page buffer content as below:
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -953,11 +953,15 @@ heapgettup_pagemode(HeapScanDesc scan,
 Page    page;
 int lineindex;
 int linesleft;
+char    page_copy[BLCKSZ];

 if (likely(scan->rs_inited))
 {
 /* continue from previously returned page/tuple */
 page = BufferGetPage(scan->rs_cbuf);
+memcpy(page_copy, page, BLCKSZ);
+for (int i = 0; i < 100; i++)
+Assert(memcmp(page_copy, page, BLCKSZ) == 0);

 lineindex = scan->rs_cindex + dir;
 if (ScanDirectionIsForward(dir))
@@ -986,6

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

Assert in heapgettup_pagemode() fails due to underlying buffer change

14 matches

Site Navigation

Mail list logo

Footer information