Re: WAL record CRC calculated incorrectly because of underlying buffer modification

2024-05-17 Thread Robert Haas
On Fri, May 17, 2024 at 10:56 AM Jeff Davis  wrote:
> I'm still not entirely clear on why hash indexes can't just follow the
> rules and exclusive lock the buffer and dirty it. Presumably
> performance would suffer, but I asked that question previously and
> didn't get an answer:
>
> https://www.postgresql.org/message-id/CA%2BTgmoY%2BdagCyrMKau7UQeQU6w4LuVEu%2ByjsmJBoXKAo6XbUUA%40mail.gmail.com

In my defense, the last time I worked on hash indexes was 7 years ago.
If this question had come up within a year or two of that work, I
probably would have both (a) had a much clearer idea of what the
answer was and (b) felt obliged to drop everything and go research it
if I didn't. But at this point, I feel like it's fair for me to tell
you what I know and leave it to you to do further research if you feel
like that's warranted. I know that we're each responsible for what we
commit, but I don't really think that should extend to having to
prioritize answering a hypothetical question ("what would happen if X
thing worked like Y instead of the way it does?") about an area I
haven't touched in long enough for every release that doesn't contain
those commits to be out of support. If you feel otherwise, let's have
that argument, but I have a feeling that it may be more that you're
hoping I have some kind of oracular powers which, in reality, I lack.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: WAL record CRC calculated incorrectly because of underlying buffer modification

2024-05-17 Thread Jeff Davis
On Fri, 2024-05-17 at 10:12 +0900, Michael Paquier wrote:
> That's something I've done four weeks ago in the hash replay code
> path, and having the image with XLR_CHECK_CONSISTENCY even if
> REGBUF_NO_CHANGE was necessary because replay was setting up a LSN on
> a REGBUF_NO_CHANGE page it should not have touched.

Then the candidate fix to selectively break XLR_CHECK_CONSISTENCY is
not acceptable.

> 
> Yeah, agreed that getting rid of REGBUF_NO_CHANGE would be nice in
> the
> final picture.  It still strikes me as a weird concept that WAL
> replay
> for hash indexes logs full pages just to be able to lock them at
> replay based on what's in the records.  :/

I'm still not entirely clear on why hash indexes can't just follow the
rules and exclusive lock the buffer and dirty it. Presumably
performance would suffer, but I asked that question previously and
didn't get an answer:

https://www.postgresql.org/message-id/CA%2BTgmoY%2BdagCyrMKau7UQeQU6w4LuVEu%2ByjsmJBoXKAo6XbUUA%40mail.gmail.com

And if that does affect performance, what about following the same
protocol as XLogSaveBufferForHint()?

Regards,
Jeff Davis





Re: WAL record CRC calculated incorrectly because of underlying buffer modification

2024-05-16 Thread Michael Paquier
On Thu, May 16, 2024 at 03:54:52PM -0700, Jeff Davis wrote:
> On Mon, 2024-05-13 at 11:15 +1200, Thomas Munro wrote:
>> Perhaps a no-image, no-change registered buffer should not be
>> including an image, even for XLR_CHECK_CONSISTENCY?  It's
>> actually
>> useless for consistency checking too I guess, this issue
>> aside,
>> because it doesn't change anything so there is nothing to
>> check.
> 
> I'm not convinced by that reasoning. Can't it check that nothing has
> changed?

That's something I've done four weeks ago in the hash replay code
path, and having the image with XLR_CHECK_CONSISTENCY even if
REGBUF_NO_CHANGE was necessary because replay was setting up a LSN on
a REGBUF_NO_CHANGE page it should not have touched.

> I'd prefer that we find a way to get rid of REGBUF_NO_CHANGE and make
> all callers follow the rules than to find new special cases that depend
> on REGBUF_NO_CHANGE. See these messages here:
> 
> https://www.postgresql.org/message-id/b1f2d0f230d60fa8df33bb0b2af3236fde9d750d.camel%40j-davis.com
> 
> https://www.postgresql.org/message-id/CA%2BTgmoY%2BdagCyrMKau7UQeQU6w4LuVEu%2ByjsmJBoXKAo6XbUUA%40mail.gmail.com
> 
> In other words, we added REGBUF_NO_CHANGE for the call sites (only hash
> indexes) that don't follow the rules and where it wasn't easy to make
> them follow the rules. Now that we know of a concrete problem with the
> design, there's more upside to fixing it properly.

Yeah, agreed that getting rid of REGBUF_NO_CHANGE would be nice in the
final picture.  It still strikes me as a weird concept that WAL replay
for hash indexes logs full pages just to be able to lock them at
replay based on what's in the records.  :/
--
Michael


signature.asc
Description: PGP signature


Re: WAL record CRC calculated incorrectly because of underlying buffer modification

2024-05-16 Thread Jeff Davis
On Mon, 2024-05-13 at 11:15 +1200, Thomas Munro wrote:
> > > > > Perhaps a no-image, no-change registered buffer should not be
> > > > > including an image, even for XLR_CHECK_CONSISTENCY?  It's
> > > > > actually
> > > > > useless for consistency checking too I guess, this issue
> > > > > aside,
> > > > > because it doesn't change anything so there is nothing to
> > > > > check.

I'm not convinced by that reasoning. Can't it check that nothing has
changed?

> > > 
> > > Does it reproduce if you do this?
> > > 
> > > -   include_image = needs_backup || (info &
> > > XLR_CHECK_CONSISTENCY) != 0;
> > > +   include_image = needs_backup ||
> > > +   ((info & XLR_CHECK_CONSISTENCY) != 0 &&
> > > +    (regbuf->flags & REGBUF_NO_CHANGE) ==
> > > 0);
> > 
> > No, it doesn't (at least with the latter, more targeted
> > reproducer).
> 
> OK so that seems like a candidate fix, but ...

...

> ... we'd need to figure out how to fix this in the back-branches too.
> One idea would be to back-patch REGBUF_NO_CHANGE, and another might
> be
> to deduce that case from other variables.  Let me CC a couple more
> people from this thread, which most recently hacked on this stuff, to
> see if they have insights:


Starting from the beginning, XLogRecordAssemble() calculates the CRC of
the record (technically just the non-header portions, but that's not
important here), including any backup blocks. Later,
CopyXLogRecordToWAL() copies that data into the actual xlog buffers. If
the data changes between those two steps, the CRC will be bad.

For most callers, the contents are exclusive-locked, so there's no
problem. For checksums, the data is copied out of shared memory into a
stack variable first, so no concurrent activity can change it. For hash
indexes, it tries to protect itself by passing REGBUF_NO_IMAGE.

There are two problems:

1. That implies another invariant that we aren't checking for: that
REGBUF_NO_CHANGE must be accompanied by REGBUF_NO_IMAGE. That doesn't
seem to be true for all callers, see XLogRegisterBuffer(1, wbuf,
wbuf_flags) in _hash_freeovflpage().

2. As you point out, REGBUF_NO_IMAGE still allows an image to be taken
if XLR_CHECK_CONSISTENCY is set, so we need to figure out what to do
there.

Can we take a step back and think harder about what hash indexes are
doing and if there's a better way? Maybe hash indexes need to take a
copy of the page, like in XLogSaveBufferForHint()?

I'd prefer that we find a way to get rid of REGBUF_NO_CHANGE and make
all callers follow the rules than to find new special cases that depend
on REGBUF_NO_CHANGE. See these messages here:

https://www.postgresql.org/message-id/b1f2d0f230d60fa8df33bb0b2af3236fde9d750d.camel%40j-davis.com

https://www.postgresql.org/message-id/CA%2BTgmoY%2BdagCyrMKau7UQeQU6w4LuVEu%2ByjsmJBoXKAo6XbUUA%40mail.gmail.com

In other words, we added REGBUF_NO_CHANGE for the call sites (only hash
indexes) that don't follow the rules and where it wasn't easy to make
them follow the rules. Now that we know of a concrete problem with the
design, there's more upside to fixing it properly.

Regards,
Jeff Davis





Re: WAL record CRC calculated incorrectly because of underlying buffer modification

2024-05-12 Thread Thomas Munro
On Sat, May 11, 2024 at 5:00 PM Alexander Lakhin  wrote:
> 11.05.2024 07:25, Thomas Munro wrote:
> > On Sat, May 11, 2024 at 4:00 PM Alexander Lakhin  
> > wrote:
> >> 11.05.2024 06:26, Thomas Munro wrote:
> >>> Perhaps a no-image, no-change registered buffer should not be
> >>> including an image, even for XLR_CHECK_CONSISTENCY?  It's actually
> >>> useless for consistency checking too I guess, this issue aside,
> >>> because it doesn't change anything so there is nothing to check.

> >> Yes, I think something wrong is here. I've reduced the reproducer to:

> > Does it reproduce if you do this?
> >
> > -   include_image = needs_backup || (info &
> > XLR_CHECK_CONSISTENCY) != 0;
> > +   include_image = needs_backup ||
> > +   ((info & XLR_CHECK_CONSISTENCY) != 0 &&
> > +(regbuf->flags & REGBUF_NO_CHANGE) == 0);
>
> No, it doesn't (at least with the latter, more targeted reproducer).

OK so that seems like a candidate fix, but ...

> > Unfortunately the back branches don't have that new flag from 00d7fb5e
> > so, even if this is the right direction (not sure, I don't understand
> > this clean registered buffer trick) then ... but wait, why are there
> > are no failures like this in the back branches (yet at least)?  Does
> > your reproducer work for 16?  I wonder if something relevant changed
> > recently, like f56a9def.  CC'ing Michael and Amit K for info.
>
> Maybe it's hard to hit (autovacuum needs to process the index page in a
> narrow time frame), but locally I could reproduce the issue even on
> ac27c74de(~1 too) from 2018-09-06 (I tried several last commits touching
> hash indexes, didn't dig deeper).

... we'd need to figure out how to fix this in the back-branches too.
One idea would be to back-patch REGBUF_NO_CHANGE, and another might be
to deduce that case from other variables.  Let me CC a couple more
people from this thread, which most recently hacked on this stuff, to
see if they have insights:

https://www.postgresql.org/message-id/flat/d2c31606e6bb9b83a02ed4835d65191b38d4ba12.camel%40j-davis.com




Re: WAL record CRC calculated incorrectly because of underlying buffer modification

2024-05-10 Thread Alexander Lakhin

11.05.2024 07:25, Thomas Munro wrote:

On Sat, May 11, 2024 at 4:00 PM Alexander Lakhin  wrote:

11.05.2024 06:26, Thomas Munro wrote:

Perhaps a no-image, no-change registered buffer should not be
including an image, even for XLR_CHECK_CONSISTENCY?  It's actually
useless for consistency checking too I guess, this issue aside,
because it doesn't change anything so there is nothing to check.

Yes, I think something wrong is here. I've reduced the reproducer to:

Does it reproduce if you do this?

-   include_image = needs_backup || (info &
XLR_CHECK_CONSISTENCY) != 0;
+   include_image = needs_backup ||
+   ((info & XLR_CHECK_CONSISTENCY) != 0 &&
+(regbuf->flags & REGBUF_NO_CHANGE) == 0);


No, it doesn't (at least with the latter, more targeted reproducer).


Unfortunately the back branches don't have that new flag from 00d7fb5e
so, even if this is the right direction (not sure, I don't understand
this clean registered buffer trick) then ... but wait, why are there
are no failures like this in the back branches (yet at least)?  Does
your reproducer work for 16?  I wonder if something relevant changed
recently, like f56a9def.  CC'ing Michael and Amit K for info.


Maybe it's hard to hit (autovacuum needs to process the index page in a
narrow time frame), but locally I could reproduce the issue even on
ac27c74de(~1 too) from 2018-09-06 (I tried several last commits touching
hash indexes, didn't dig deeper).

Best regards,
Alexander




Re: WAL record CRC calculated incorrectly because of underlying buffer modification

2024-05-10 Thread Thomas Munro
On Sat, May 11, 2024 at 4:00 PM Alexander Lakhin  wrote:
> 11.05.2024 06:26, Thomas Munro wrote:
> > Perhaps a no-image, no-change registered buffer should not be
> > including an image, even for XLR_CHECK_CONSISTENCY?  It's actually
> > useless for consistency checking too I guess, this issue aside,
> > because it doesn't change anything so there is nothing to check.
>
> Yes, I think something wrong is here. I've reduced the reproducer to:

Does it reproduce if you do this?

-   include_image = needs_backup || (info &
XLR_CHECK_CONSISTENCY) != 0;
+   include_image = needs_backup ||
+   ((info & XLR_CHECK_CONSISTENCY) != 0 &&
+(regbuf->flags & REGBUF_NO_CHANGE) == 0);

Unfortunately the back branches don't have that new flag from 00d7fb5e
so, even if this is the right direction (not sure, I don't understand
this clean registered buffer trick) then ... but wait, why are there
are no failures like this in the back branches (yet at least)?  Does
your reproducer work for 16?  I wonder if something relevant changed
recently, like f56a9def.  CC'ing Michael and Amit K for info.




Re: WAL record CRC calculated incorrectly because of underlying buffer modification

2024-05-10 Thread Alexander Lakhin

Hello Thomas and Andres,

Thank you for looking at this!

11.05.2024 06:26, Thomas Munro wrote:

On Sat, May 11, 2024 at 3:57 AM Andres Freund  wrote:

On 2024-05-10 16:00:01 +0300, Alexander Lakhin wrote:

and discovered that XLogRecordAssemble() calculates CRC over a buffer,
that might be modified by another process.

If, with "might", you mean that it's legitimate for that buffer to be
modified, I don't think so.  The bug is that something is modifying the buffer
despite it being exclusively locked.

I.e. what we likely have here is a bug somewhere in the hash index code.

I don't have a good grip on the higher level locking protocols of
hash.c, but one microscopic thing jumps out:

 /*
  * bucket buffer was not changed, but still needs to be
  * registered to ensure that we can acquire a cleanup lock on
  * it during replay.
  */
 if (!xlrec.is_primary_bucket_page)
 {
 uint8flags = REGBUF_STANDARD |
REGBUF_NO_IMAGE | REGBUF_NO_CHANGE;

 XLogRegisterBuffer(0, bucket_buf, flags);
 }

That registers a buffer that is pinned but not content-locked, and we
tell xloginsert.c not to copy its image into the WAL, but it does it
anyway because:

 /*
  * If needs_backup is true or WAL checking is enabled for current
  * resource manager, log a full-page write for the current block.
  */
 include_image = needs_backup || (info & XLR_CHECK_CONSISTENCY) != 0;

So I guess it copies the image on dodo, which has:

 'PG_TEST_EXTRA' => 'ssl ldap
kerberos wal_consistency_checking libpq_encryption xid_wraparound'

Perhaps a no-image, no-change registered buffer should not be
including an image, even for XLR_CHECK_CONSISTENCY?  It's actually
useless for consistency checking too I guess, this issue aside,
because it doesn't change anything so there is nothing to check.


Yes, I think something wrong is here. I've reduced the reproducer to:
cat << 'EOF' | psql
CREATE TABLE hash_cleanup_heap(keycol INT);
CREATE INDEX hash_cleanup_index on hash_cleanup_heap USING HASH (keycol);

BEGIN;
INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i;
ROLLBACK;
EOF

cat << 'EOF' | psql &
INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i;

DROP TABLE hash_cleanup_heap;
EOF

cat << 'EOF' | psql &
SELECT pg_sleep(random() / 20);
VACUUM hash_cleanup_heap;
EOF
wait
grep 'TRAP:' server.log

("wal_consistency_checking = all" and the xloginsert patch are still required)

and with additional logging I see:
2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96dd|LOG: 
!!!hashbucketcleanup| scan page buf: 1832
2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96dd|CONTEXT: while vacuuming index "hash_cleanup_index" of 
relation "public.hash_cleanup_heap"

2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96dd|STATEMENT:  VACUUM 
hash_cleanup_heap;
2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96de|LOG: 
!!!_hash_doinsert| _hash_getbucketbuf_from_hashkey: 1822
2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96de|STATEMENT:  INSERT INTO hash_cleanup_heap SELECT 1 FROM 
generate_series(1, 500) as i;
2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96dd|LOG: !!!hashbucketcleanup| xlrec.is_primary_bucket_page: 0, 
buf: 1832, bucket_buf: 1822
2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96dd|CONTEXT: while vacuuming index "hash_cleanup_index" of 
relation "public.hash_cleanup_heap"

2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96dd|STATEMENT:  VACUUM 
hash_cleanup_heap;
2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96de|LOG: 
!!!_hash_doinsert| _hash_relbuf(rel, 1822)
2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96de|STATEMENT:  INSERT INTO hash_cleanup_heap SELECT 1 FROM 
generate_series(1, 500) as i;

TRAP: failed Assert("memcmp(block1_ptr, block1_copy, block1_len) == 0"), File: 
"xloginsert.c", Line: 949, PID: 2070237

Best regards,
Alexander




Re: WAL record CRC calculated incorrectly because of underlying buffer modification

2024-05-10 Thread Thomas Munro
On Sat, May 11, 2024 at 3:57 AM Andres Freund  wrote:
> On 2024-05-10 16:00:01 +0300, Alexander Lakhin wrote:
> > and discovered that XLogRecordAssemble() calculates CRC over a buffer,
> > that might be modified by another process.
>
> If, with "might", you mean that it's legitimate for that buffer to be
> modified, I don't think so.  The bug is that something is modifying the buffer
> despite it being exclusively locked.
>
> I.e. what we likely have here is a bug somewhere in the hash index code.

I don't have a good grip on the higher level locking protocols of
hash.c, but one microscopic thing jumps out:

/*
 * bucket buffer was not changed, but still needs to be
 * registered to ensure that we can acquire a cleanup lock on
 * it during replay.
 */
if (!xlrec.is_primary_bucket_page)
{
uint8flags = REGBUF_STANDARD |
REGBUF_NO_IMAGE | REGBUF_NO_CHANGE;

XLogRegisterBuffer(0, bucket_buf, flags);
}

That registers a buffer that is pinned but not content-locked, and we
tell xloginsert.c not to copy its image into the WAL, but it does it
anyway because:

/*
 * If needs_backup is true or WAL checking is enabled for current
 * resource manager, log a full-page write for the current block.
 */
include_image = needs_backup || (info & XLR_CHECK_CONSISTENCY) != 0;

So I guess it copies the image on dodo, which has:

'PG_TEST_EXTRA' => 'ssl ldap
kerberos wal_consistency_checking libpq_encryption xid_wraparound'

Perhaps a no-image, no-change registered buffer should not be
including an image, even for XLR_CHECK_CONSISTENCY?  It's actually
useless for consistency checking too I guess, this issue aside,
because it doesn't change anything so there is nothing to check.




Re: WAL record CRC calculated incorrectly because of underlying buffer modification

2024-05-10 Thread Andres Freund
Hi,

On 2024-05-10 16:00:01 +0300, Alexander Lakhin wrote:
> and discovered that XLogRecordAssemble() calculates CRC over a buffer,
> that might be modified by another process.

If, with "might", you mean that it's legitimate for that buffer to be
modified, I don't think so.  The bug is that something is modifying the buffer
despite it being exclusively locked.

I.e. what we likely have here is a bug somewhere in the hash index code.

Greetings,

Andres Freund