Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2023-04-09 Thread Michael Paquier
On Sat, Apr 08, 2023 at 04:24:35PM +0200, Matthias van de Meent wrote:
> Thanks a lot! I'll post the separation of record construction and
> write-out to xlog in a future thread for 17.

Thanks!  Creating a new thread makes sense.

> One remaining question: Considering that the changes and checks of
> that commit are mostly internal to xloginsert.c (or xlog.c in older
> releases), and that no special public-facing changes were made, would
> it be safe to backport this to older releases?

The routine changes done in ffd1b6b cannot be backpatched on ABI
grounds, still you would propose to have protection around
needs_data as well as the whole record length.

> PostgreSQL 15 specifically would benefit from this as it supports
> external rmgrs which may generate WAL records and would benefit from
> these additional checks, but all supported releases of PostgreSQL have
> pg_logical_emit_message and are thus easily subject to the issue of
> writing oversized WAL records and subsequent recovery- and replication
> stream failures.

Custom RMGRs are a good argument, though I don't really see an urgent
argument about doing something in REL_15_STABLE.  For one, it would
mean more backpatching conflicts with ~14.  Another argument is that
XLogRecordMaxSize is not an exact science, either.  In ~15, a record
with a total size between XLogRecordMaxSize and
DecodeXLogRecordRequiredSpace(MaxAllocSize) would work, though it
would not in 16~ because we have the 4MB margin given as room for the
per-record allocation in the XLogReader.  A record of such a size
would not be generated anymore after a minor release update of 15.3~
if we were to do something about that by May on REL_15_STABLE.
--
Michael


signature.asc
Description: PGP signature


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2023-04-08 Thread Matthias van de Meent
On Fri, 7 Apr 2023 at 08:05, Michael Paquier  wrote:
>
> On Fri, Apr 07, 2023 at 08:59:22AM +0900, Michael Paquier wrote:
> > Okay, cool!
>
> Done this one with 8fcb32d.

Thanks a lot! I'll post the separation of record construction and
write-out to xlog in a future thread for 17.

One remaining question: Considering that the changes and checks of
that commit are mostly internal to xloginsert.c (or xlog.c in older
releases), and that no special public-facing changes were made, would
it be safe to backport this to older releases?

PostgreSQL 15 specifically would benefit from this as it supports
external rmgrs which may generate WAL records and would benefit from
these additional checks, but all supported releases of PostgreSQL have
pg_logical_emit_message and are thus easily subject to the issue of
writing oversized WAL records and subsequent recovery- and replication
stream failures.

Kind regards,

Matthias van de Meent




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2023-04-07 Thread Michael Paquier
On Fri, Apr 07, 2023 at 08:59:22AM +0900, Michael Paquier wrote:
> Okay, cool!

Done this one with 8fcb32d.
--
Michael


signature.asc
Description: PGP signature


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2023-04-06 Thread Michael Paquier
On Fri, Apr 07, 2023 at 01:50:00AM +0200, Matthias van de Meent wrote:
> Yes, that was a bad oversight, which would've shown up in tests on a system
> with an endianness that my computer doesn't have...

I don't think that we have many bigendian animals in the buildfarm,
either.. 

> That looks fine to me. Thanks for picking this up and fixing the issue.

Okay, cool!
--
Michael


signature.asc
Description: PGP signature


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2023-04-06 Thread Matthias van de Meent
On Fri, 7 Apr 2023, 01:35 Michael Paquier,  wrote:

> On Fri, Apr 07, 2023 at 08:08:34AM +0900, Michael Paquier wrote:
> > So bumping mainrdata_len to uint64 is actually not entirely in line
> > with this code.  Well, it will work because we'd still fail a couple
> > of lines down, but perhaps its readability should be improved so as
> > we have an extra check in this code path to make sure that
> > mainrdata_len is not higher than PG_UINT32_MAX, then use an
> > intermediate casted variable before saving the length in the record
> > data to make clear that the type of the main static length in
> > xloginsert.c is not the same as what a record has?  The v10 I sent
> > previously blocked this possibility, but not v11.
>

Yes, that was a bad oversight, which would've shown up in tests on a system
with an endianness that my computer doesn't have...


> So, I was thinking about something like the attached tweaking this
> point, the error details a bit, applying an indentation and writing a
> commit message...  Matthias?
>

That looks fine to me. Thanks for picking this up and fixing the issue.



Kind regards,

Matthias van de Meent


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2023-04-06 Thread Michael Paquier
On Fri, Apr 07, 2023 at 08:08:34AM +0900, Michael Paquier wrote:
> So bumping mainrdata_len to uint64 is actually not entirely in line
> with this code.  Well, it will work because we'd still fail a couple
> of lines down, but perhaps its readability should be improved so as
> we have an extra check in this code path to make sure that
> mainrdata_len is not higher than PG_UINT32_MAX, then use an
> intermediate casted variable before saving the length in the record
> data to make clear that the type of the main static length in
> xloginsert.c is not the same as what a record has?  The v10 I sent
> previously blocked this possibility, but not v11.

So, I was thinking about something like the attached tweaking this
point, the error details a bit, applying an indentation and writing a
commit message...  Matthias?
--
Michael
From 10eff110176a4988295118b0f17a2a868fb9b96e Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 7 Apr 2023 08:19:24 +0900
Subject: [PATCH v12] Add more protections in WAL record APIs against overflows

This commit adds a limit to the size of an XLogRecord at 1020MiB, based
on a suggestion by Heikki Linnakangas.  This counts for the overhead
needed by the xlogreader when allocating the memory it needs to read a
record in DecodeXLogRecordRequiredSpace().  An assertion based on that
is added to make sure to detect that any additions in the XLogReader
facilities would not cause any overflows.  If that's ever the case, the
upper bound would need to be adjusted.

Before this, it was possible for an external module to create WAL
records large enough to be assembled still not replayable, causing
failures when replaying such WAL records on standbys.  One case
mentioned where this is possible is the in-core function
pg_logical_emit_message(), that allows to emit WAL records with an
arbitrary amount of data up to 2GB, much higher than the replay limit
of approximately 1GB (limit of a palloc).

This commit is a follow-up of ffd1b6b that has added similar protections
for the block-level data.  Here, the checks are extended to the whole
record length.  All the error messages related to overflow checks are
improved to provide more context about what is happening.

Author: Matthias van de Meent
Discussion: https://postgr.es/m/CAEze2WgGiw+LZt+vHf8tWqB_6VxeLsMeoAuod0N=ij1q17n...@mail.gmail.com
---
 src/include/access/xlogrecord.h | 11 +
 src/backend/access/transam/xloginsert.c | 60 ++---
 2 files changed, 64 insertions(+), 7 deletions(-)

diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 0d576f7883..f355e08e1d 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -62,6 +62,17 @@ typedef struct XLogRecord
 #define XLR_INFO_MASK			0x0F
 #define XLR_RMGR_INFO_MASK		0xF0
 
+/*
+ * XLogReader needs to allocate all the data of a WAL record in a single
+ * chunk.  This means that a single XLogRecord cannot exceed MaxAllocSize
+ * in length if we ignore any allocation overhead of the XLogReader.
+ *
+ * To accommodate some overhead, this value allows for 4M of allocation
+ * overhead, that should be plenty enough for what
+ * DecodeXLogRecordRequiredSpace() expects as extra.
+ */
+#define XLogRecordMaxSize	(1020 * 1024 * 1024)
+
 /*
  * If a WAL record modifies any relation files, in ways not covered by the
  * usual block references, this flag is set. This is not used for anything
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 008612e032..b71e690c5b 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -98,7 +98,7 @@ static int	max_registered_block_id = 0;	/* highest block_id + 1 currently
  */
 static XLogRecData *mainrdata_head;
 static XLogRecData *mainrdata_last = (XLogRecData *) _head;
-static uint32 mainrdata_len;	/* total # of bytes in chain */
+static uint64 mainrdata_len;	/* total # of bytes in chain */
 
 /* flags for the in-progress insertion */
 static uint8 curinsert_flags = 0;
@@ -355,7 +355,10 @@ XLogRegisterData(char *data, uint32 len)
 	Assert(begininsert_called);
 
 	if (num_rdatas >= max_rdatas)
-		elog(ERROR, "too much WAL data");
+		ereport(ERROR,
+(errmsg_internal("too much WAL data"),
+ errdetail_internal("%u out of %u data segments are already in use.",
+	num_rdatas, max_rdatas)));
 	rdata = [num_rdatas++];
 
 	rdata->data = data;
@@ -405,9 +408,16 @@ XLogRegisterBufData(uint8 block_id, char *data, uint32 len)
 	 * regbuf->rdata_len does not grow beyond what
 	 * XLogRecordBlockHeader->data_length can hold.
 	 */
-	if (num_rdatas >= max_rdatas ||
-		regbuf->rdata_len + len > UINT16_MAX)
-		elog(ERROR, "too much WAL data");
+	if (num_rdatas >= max_rdatas)
+		ereport(ERROR,
+(errmsg_internal("too much WAL data"),
+ errdetail_internal("%u out of %u data segments are already in use.",
+	num_rdatas, max_rdatas)));
+	if (regbuf->rdata_len + len > UINT16_MAX 

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2023-04-06 Thread Michael Paquier
On Thu, Apr 06, 2023 at 10:54:43AM +0900, Michael Paquier wrote:
> 0002 can also be done before 0001, so I'd like to get that part
> applied on HEAD before the feature freeze and close this thread.  If
> there are any objections, please feel free..

I was doing a pre-commit review of the patch, and double-checked the
uses of mainrdata_len.   And there is this part:
/* followed by main data, if any */
if (mainrdata_len > 0)
{
if (mainrdata_len > 255)
{
*(scratch++) = (char) XLR_BLOCK_ID_DATA_LONG;
memcpy(scratch, _len, sizeof(uint32));
scratch += sizeof(uint32);
}
else
{
*(scratch++) = (char) XLR_BLOCK_ID_DATA_SHORT;
*(scratch++) = (uint8) mainrdata_len;
}
rdt_datas_last->next = mainrdata_head;
rdt_datas_last = mainrdata_last;
total_len += mainrdata_len;
}
rdt_datas_last->next = NULL;

So bumping mainrdata_len to uint64 is actually not entirely in line
with this code.  Well, it will work because we'd still fail a couple
of lines down, but perhaps its readability should be improved so as
we have an extra check in this code path to make sure that
mainrdata_len is not higher than PG_UINT32_MAX, then use an
intermediate casted variable before saving the length in the record
data to make clear that the type of the main static length in
xloginsert.c is not the same as what a record has?  The v10 I sent
previously blocked this possibility, but not v11.
--
Michael


signature.asc
Description: PGP signature


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2023-04-05 Thread Michael Paquier
On Wed, Apr 05, 2023 at 04:35:37PM +0200, Matthias van de Meent wrote:
> I thought that the plan was to use int64 to skip checking for most
> overflows and to do a single check at the end in XLogRecordAssemble,
> so that the checking has minimal overhead in the performance-critical
> log record assembling path and reduced load on the branch predictor.

And that's the reason why your v11-0002 is better and simpler than the
v10-0001 I posted a few days ago.

+   if (regbuf->rdata_len + len > UINT16_MAX || len > UINT16_MAX)
+   ereport(ERROR,
+   (errmsg_internal("too much WAL data"),
+errdetail_internal("Registering more than max %u bytes total 
to block %u: current %uB, adding %uB",
+   UINT16_MAX, block_id, regbuf->rdata_len, 
len)));

I was wondering for a few minutes about the second part of this
check..  But you are worried about the case where len is too large
that it would overflow rdata_len if calling XLogRegisterBufData() more
than once on the same block, if len is between
(UINT32_MAX-UINT16_MAX,UINT32_MAX) on the second call.

The extra errdetail_internal() could be tweaked a bit more, but I'm
also OK with your proposal, overall.  One thing is "current %uB,
adding %uB" would be better using "bytes".

> One more issue that Andres was suggesting we'd fix was to allow XLog
> assembly separate from the actual XLog insertion:
> Currently you can't pre-assemble a record outside a critical section
> if the record must be inserted in a critical section, which makes e.g.
> commit records problematic due to the potentially oversized data
> resulting in ERRORs during record assembly. This would crash postgres
> because commit xlog insertion happens in a critical section. Having a
> pre-assembled record would greatly improve the ergonomics in that path
> and reduce the length of the critical path.
>
> I think it was something along the lines of the attached; 0001
> contains separated Commit/Abort record construction and insertion like
> Andres suggested,

I am honestly not sure whether we should complicate xloginsert.c this
way, but we could look at that for v17.

> 0002 does the size checks with updated error messages.

0002 can also be done before 0001, so I'd like to get that part
applied on HEAD before the feature freeze and close this thread.  If
there are any objections, please feel free..
--
Michael


signature.asc
Description: PGP signature


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2023-04-05 Thread Matthias van de Meent
On Tue, 28 Mar 2023 at 13:42, Michael Paquier  wrote:
>
> On Mon, Dec 19, 2022 at 12:37:19PM +0100, Alvaro Herrera wrote:
> > I have created one in the January commitfest,
> > https://commitfest.postgresql.org/41/
> > and rebased the patch on current master.  (I have not reviewed this.)
>
> I have spent some time on that, and here are some comments with an
> updated version of the patch attached.
>
> The checks in XLogRegisterData() seemed overcomplicated to me.  In
> this context, I think that we should just care about making sure that
> mainrdata_len does not overflow depending on the length given by the
> caller, which is where pg_add_u32_overflow() becomes handy.
>
> XLogRegisterBufData() added a check on UINT16_MAX in an assert, though
> we already check for overflow a couple of lines down.  This is not
> necessary, it seems.
>
> @@ -535,6 +567,9 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
> XLogRecord *rechdr;
> char   *scratch = hdr_scratch;
>
> +   /* ensure that any assembled record can be decoded */
> +   
> Assert(AllocSizeIsValid(DecodeXLogRecordRequiredSpace(MaxXLogRecordSize)));
>
> A hardcoded check like that has no need to be in a code path triggered
> each time a WAL record is assembled.  One place where this could be is
> InitXLogInsert().  It still means that it is called one time for each
> backend, but seeing it where the initialization of xloginsert.c feels
> natural, at least.  A postmaster location would be enough, as well.
>
> XLogRecordMaxSize just needs to be checked once IMO, around the end of
> XLogRecordAssemble() once we know the total size of the record that
> will be fed to a XLogReader.  One thing that we should be more careful
> of is to make sure that total_len does not overflow its uint32 value
> while assembling the record, as well.
>
> I have removed XLogErrorDataLimitExceeded(), replacing it with more
> context about the errors happening.  Perhaps this has no need to be
> that much verbose, but it can be really useful for developers.
>
> Some comments had no need to be updated, and there were some typos.
>
> I am on board with the idea of a XLogRecordMaxSize that's bounded at
> 1020MB, leaving 4MB as room for the extra data needed by a
> XLogReader.
>
> At the end, I think that this is quite interesting long-term.  For
> example, if we lift up XLogRecordMaxSize, we can evaluate the APIs
> adding buffer data or main data separately.
>
> Thoughts about this version?

I thought that the plan was to use int64 to skip checking for most
overflows and to do a single check at the end in XLogRecordAssemble,
so that the checking has minimal overhead in the performance-critical
log record assembling path and reduced load on the branch predictor.

One more issue that Andres was suggesting we'd fix was to allow XLog
assembly separate from the actual XLog insertion:
Currently you can't pre-assemble a record outside a critical section
if the record must be inserted in a critical section, which makes e.g.
commit records problematic due to the potentially oversized data
resulting in ERRORs during record assembly. This would crash postgres
because commit xlog insertion happens in a critical section. Having a
pre-assembled record would greatly improve the ergonomics in that path
and reduce the length of the critical path.

I think it was something along the lines of the attached; 0001
contains separated Commit/Abort record construction and insertion like
Andres suggested, 0002 does the size checks with updated error
messages.


Kind regards,

Matthias van de Meent


v11-0001-Create-a-path-for-separate-xlog-record-construct.patch
Description: Binary data


v11-0002-Add-protections-in-xlog-record-APIs-against-over.patch
Description: Binary data


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2023-03-28 Thread Michael Paquier
On Mon, Dec 19, 2022 at 12:37:19PM +0100, Alvaro Herrera wrote:
> I have created one in the January commitfest,
> https://commitfest.postgresql.org/41/
> and rebased the patch on current master.  (I have not reviewed this.)

I have spent some time on that, and here are some comments with an
updated version of the patch attached.

The checks in XLogRegisterData() seemed overcomplicated to me.  In
this context, I think that we should just care about making sure that
mainrdata_len does not overflow depending on the length given by the
caller, which is where pg_add_u32_overflow() becomes handy.

XLogRegisterBufData() added a check on UINT16_MAX in an assert, though
we already check for overflow a couple of lines down.  This is not
necessary, it seems.

@@ -535,6 +567,9 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecord *rechdr;
char   *scratch = hdr_scratch;
 
+   /* ensure that any assembled record can be decoded */
+   Assert(AllocSizeIsValid(DecodeXLogRecordRequiredSpace(MaxXLogRecordSize)));

A hardcoded check like that has no need to be in a code path triggered
each time a WAL record is assembled.  One place where this could be is
InitXLogInsert().  It still means that it is called one time for each
backend, but seeing it where the initialization of xloginsert.c feels
natural, at least.  A postmaster location would be enough, as well.

XLogRecordMaxSize just needs to be checked once IMO, around the end of
XLogRecordAssemble() once we know the total size of the record that
will be fed to a XLogReader.  One thing that we should be more careful
of is to make sure that total_len does not overflow its uint32 value
while assembling the record, as well.

I have removed XLogErrorDataLimitExceeded(), replacing it with more
context about the errors happening.  Perhaps this has no need to be
that much verbose, but it can be really useful for developers.

Some comments had no need to be updated, and there were some typos.

I am on board with the idea of a XLogRecordMaxSize that's bounded at
1020MB, leaving 4MB as room for the extra data needed by a
XLogReader.

At the end, I think that this is quite interesting long-term.  For
example, if we lift up XLogRecordMaxSize, we can evaluate the APIs
adding buffer data or main data separately.

Thoughts about this version?
--
Michael
From 16b40324a5bc5bbe6e06cbfb86251c907f400151 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 28 Mar 2023 20:34:31 +0900
Subject: [PATCH v10] Add protections in xlog record APIs against overflows

Before this, it was possible for an extension to create malicious WAL
records that were too large to replay; or that would overflow the
xl_tot_len field, causing potential corruption in WAL record IO ops.

Emitting invalid records was also possible through
pg_logical_emit_message(), which allowed you to emit arbitrary amounts
of data up to 2GB, much higher than the replay limit of approximately
1GB.

This patch adds a limit to the size of an XLogRecord (1020MiB), checks
against overflows, and ensures that the reader infrastructure can read
any max-sized XLogRecords, such that the wal-generating backend will
fail when it attempts to insert unreadable records, instead of that
insertion succeeding but breaking any replication streams.

Author: Matthias van de Meent 
---
 src/include/access/xlogrecord.h | 11 
 src/backend/access/transam/xloginsert.c | 73 +
 2 files changed, 74 insertions(+), 10 deletions(-)

diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 0d576f7883..0a7b0bb6fc 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -54,6 +54,17 @@ typedef struct XLogRecord
 
 #define SizeOfXLogRecord	(offsetof(XLogRecord, xl_crc) + sizeof(pg_crc32c))
 
+/*
+ * XLogReader needs to allocate all the data of an xlog record in a single
+ * chunk.  This means that a single XLogRecord cannot exceed MaxAllocSize
+ * in length if we ignore any allocation overhead of the XLogReader.
+ *
+ * To accommodate some overhead, this value allows for 4M of allocation
+ * overhead, that should be plenty enough for what
+ * DecodeXLogRecordRequiredSpace() expects as extra.
+ */
+#define XLogRecordMaxSize	(1020 * 1024 * 1024)
+
 /*
  * The high 4 bits in xl_info may be used freely by rmgr. The
  * XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 008612e032..86fa6a5276 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -33,6 +33,7 @@
 #include "access/xloginsert.h"
 #include "catalog/pg_control.h"
 #include "common/pg_lzcompress.h"
+#include "common/int.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
@@ -351,11 +352,13 @@ void
 XLogRegisterData(char *data, uint32 len)
 {
 	XLogRecData *rdata;
+	uint32		result = 0;
 
 	

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-12-19 Thread Alvaro Herrera
On 2022-Dec-02, Andres Freund wrote:

> Hi,
> 
> On 2022-12-02 14:22:55 +0900, Michael Paquier wrote:
> > On Fri, Nov 04, 2022 at 09:52:39AM +0900, Ian Lawrence Barwick wrote:
> > > CommitFest 2022-11 is currently underway, so if you are interested
> > > in moving this patch forward, now would be a good time to update it.
> > 
> > No replies after 4 weeks, so I have marked this entry as returned
> > with feedback.  I am still wondering what would be the best thing to
> > do here..
> 
> IMO this a bugfix, I don't think we can just close the entry, even if Matthias
> doesn't have time / energy to push it forward.

I have created one in the January commitfest,
https://commitfest.postgresql.org/41/
and rebased the patch on current master.  (I have not reviewed this.)

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La gente vulgar sólo piensa en pasar el tiempo;
el que tiene talento, en aprovecharlo"
>From 385ccf1b7d68923009c73db493dbe74be34e305b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 19 Dec 2022 12:26:52 +0100
Subject: [PATCH v9] Add protections in xlog record APIs against large numbers
 and overflows.

Before this, it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops.

Emitting invalid records was also possible through pg_logical_emit_message(),
which allowed you to emit arbitrary amounts of data up to 2GB, much higher
than the replay limit of approximately 1GB.

This patch adds a limit to the size of an XLogRecord (1020MiB), checks
against overflows, and ensures that the reader infrastructure can read any
max-sized XLogRecords, such that the wal-generating backend will fail when
it attempts to insert unreadable records, instead of that insertion
succeeding but breaking any replication streams.

Author: Matthias van de Meent 
---
 src/backend/access/transam/xloginsert.c | 59 ++---
 src/include/access/xlogrecord.h | 13 ++
 2 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index f811523448..c25431f0e9 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -141,6 +141,17 @@ static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
 	   bool *topxid_included);
 static bool XLogCompressBackupBlock(char *page, uint16 hole_offset,
 	uint16 hole_length, char *dest, uint16 *dlen);
+static inline void XLogErrorDataLimitExceeded();
+
+/*
+ * Error due to exceeding the maximum size of a WAL record, or registering
+ * more datas than are being accounted for by the XLog infrastructure.
+ */
+static inline void
+XLogErrorDataLimitExceeded()
+{
+	elog(ERROR, "too much WAL data");
+}
 
 /*
  * Begin constructing a WAL record. This must be called before the
@@ -352,10 +363,25 @@ XLogRegisterData(char *data, uint32 len)
 {
 	XLogRecData *rdata;
 
-	Assert(begininsert_called);
+	Assert(begininsert_called && XLogRecordLengthIsValid(len));
+
+	/*
+	 * Check against max_rdatas; and ensure we don't fill a record with
+	 * more data than can be replayed. Records are allocated in one chunk
+	 * with some overhead, so ensure XLogRecordLengthIsValid() for that
+	 * size of record.
+	 *
+	 * Additionally, check that we don't accidentally overflow the
+	 * intermediate sum value on 32-bit systems by ensuring that the
+	 * sum of the two inputs is no less than one of the inputs.
+	 */
+	if (num_rdatas >= max_rdatas ||
+#if SIZEOF_SIZE_T == 4
+		 mainrdata_len + len < len ||
+#endif
+		!XLogRecordLengthIsValid((size_t) mainrdata_len + (size_t) len))
+		XLogErrorDataLimitExceeded();
 
-	if (num_rdatas >= max_rdatas)
-		elog(ERROR, "too much WAL data");
 	rdata = [num_rdatas++];
 
 	rdata->data = data;
@@ -391,7 +417,7 @@ XLogRegisterBufData(uint8 block_id, char *data, uint32 len)
 	registered_buffer *regbuf;
 	XLogRecData *rdata;
 
-	Assert(begininsert_called);
+	Assert(begininsert_called && len <= UINT16_MAX);
 
 	/* find the registered buffer struct */
 	regbuf = _buffers[block_id];
@@ -400,14 +426,14 @@ XLogRegisterBufData(uint8 block_id, char *data, uint32 len)
 			 block_id);
 
 	/*
-	 * Check against max_rdatas and ensure we do not register more data per
+	 * Check against max_rdatas; and ensure we don't register more data per
 	 * buffer than can be handled by the physical data format; i.e. that
 	 * regbuf->rdata_len does not grow beyond what
 	 * XLogRecordBlockHeader->data_length can hold.
 	 */
 	if (num_rdatas >= max_rdatas ||
 		regbuf->rdata_len + len > UINT16_MAX)
-		elog(ERROR, "too much WAL data");
+		XLogErrorDataLimitExceeded();
 
 	rdata = [num_rdatas++];
 
@@ -527,6 +553,12 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
    XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
 {
 	XLogRecData *rdt;
+	/*
+	 * Overflow of 

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-12-02 Thread Andres Freund
Hi,

On 2022-07-26 18:58:02 +0200, Matthias van de Meent wrote:
> - updated the MaxXLogRecordSize and XLogRecordLengthIsValid(len)
> macros (now in xlogrecord.h), with a max length of the somewhat
> arbitrary 1020MiB.
>  This leaves room for approx. 4MiB of per-record allocation overhead
> before you'd hit MaxAllocSize, and also detaches the dependency on
> memutils.h.
> 
> - Retained the check in XLogRegisterData, so that we check against
> integer overflows in the registerdata code instead of only an assert
> in XLogRecordAssemble where it might be too late.
> - Kept the inline static elog-ing function (as per Andres' suggestion
> on 2022-03-14; this decreases binary sizes)

I don't think it should be a static inline. It should to be a *non* inlined
function, so we don't include the code for the elog in the callers.


> +/*
> + * Error due to exceeding the maximum size of a WAL record, or registering
> + * more datas than are being accounted for by the XLog infrastructure.
> + */
> +static inline void
> +XLogErrorDataLimitExceeded()
> +{
> + elog(ERROR, "too much WAL data");
> +}

I think this should be pg_noinline, as mentioned above.


>  /*
>   * Begin constructing a WAL record. This must be called before the
>   * XLogRegister* functions and XLogInsert().
> @@ -348,14 +359,29 @@ XLogRegisterBlock(uint8 block_id, RelFileLocator 
> *rlocator, ForkNumber forknum,
>   * XLogRecGetData().
>   */
>  void
> -XLogRegisterData(char *data, int len)
> +XLogRegisterData(char *data, uint32 len)
>  {
>   XLogRecData *rdata;
>  
> - Assert(begininsert_called);
> + Assert(begininsert_called && XLogRecordLengthIsValid(len));
> +
> + /*
> +  * Check against max_rdatas; and ensure we don't fill a record with
> +  * more data than can be replayed. Records are allocated in one chunk
> +  * with some overhead, so ensure XLogRecordLengthIsValid() for that
> +  * size of record.
> +  *
> +  * Additionally, check that we don't accidentally overflow the
> +  * intermediate sum value on 32-bit systems by ensuring that the
> +  * sum of the two inputs is no less than one of the inputs.
> +  */
> + if (num_rdatas >= max_rdatas ||
> +#if SIZEOF_SIZE_T == 4
> +  mainrdata_len + len < len ||
> +#endif
> + !XLogRecordLengthIsValid((size_t) mainrdata_len + (size_t) len))
> + XLogErrorDataLimitExceeded();

This is quite a complicated check, and the SIZEOF_SIZE_T == 4 bit is fairly
ugly.

I think we should make mainrdata_len a uint64, then we don't have to worry
about it overflowing on 32bit systems. And TBH, we don't care about some minor
inefficiency on 32bit systems.



> @@ -399,8 +425,16 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
>   elog(ERROR, "no block with id %d registered with WAL insertion",
>block_id);
>  
> - if (num_rdatas >= max_rdatas)
> - elog(ERROR, "too much WAL data");
> + /*
> +  * Check against max_rdatas; and ensure we don't register more data per
> +  * buffer than can be handled by the physical data format; 
> +  * i.e. that regbuf->rdata_len does not grow beyond what
> +  * XLogRecordBlockHeader->data_length can hold.
> +  */
> + if (num_rdatas >= max_rdatas ||
> + regbuf->rdata_len + len > UINT16_MAX)
> + XLogErrorDataLimitExceeded();
> +
>   rdata = [num_rdatas++];
>  
>   rdata->data = data;

This partially has been applied in ffd1b6bb6f8, I think we should consider
adding XLogErrorDataLimitExceeded() separately too.


>   rdt_datas_last->next = regbuf->rdata_head;
> @@ -858,6 +907,16 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
>   for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
>   COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
>  
> + /*
> +  * Ensure that the XLogRecord is not too large.
> +  *
> +  * XLogReader machinery is only able to handle records up to a certain
> +  * size (ignoring machine resource limitations), so make sure we will
> +  * not emit records larger than those sizes we advertise we support.
> +  */
> + if (!XLogRecordLengthIsValid(total_len))
> + XLogErrorDataLimitExceeded();
> +
>   /*
>* Fill in the fields in the record header. Prev-link is filled in 
> later,
>* once we know where in the WAL the record will be inserted. The CRC 
> does

I think this needs to mention that it'll typically cause a PANIC.


Greetings,

Andres Freund




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-12-02 Thread Andres Freund
Hi,

On 2022-12-02 14:22:55 +0900, Michael Paquier wrote:
> On Fri, Nov 04, 2022 at 09:52:39AM +0900, Ian Lawrence Barwick wrote:
> > CommitFest 2022-11 is currently underway, so if you are interested
> > in moving this patch forward, now would be a good time to update it.
> 
> No replies after 4 weeks, so I have marked this entry as returned
> with feedback.  I am still wondering what would be the best thing to
> do here..

IMO this a bugfix, I don't think we can just close the entry, even if Matthias
doesn't have time / energy to push it forward.


I think the big issue with the patch as it stands is that it will typically
cause PANICs on failure, because the record-too-large ERROR be a in a critical
section. That's still better than generating a record that can't be replayed,
but it's not good.

There's not all that many places with potentially huge records. I wonder if we
ought to modify at least the most prominent ones to prepare the record before
the critical section. I think the by far most prominent real-world case is
RecordTransactionCommit(). I think we could rename XactLogCommitRecord() to
XactBuildCommitRecord() build the commit record, then have the caller do
START_CRIT_SECTION(), set DELAY_CHKPT_START, and only then do the
XLogInsert().

That'd even have the nice side-effect of reducing the window in which
DELAY_CHKPT_START is set a bit.

Greetings,

Andres Freund




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-12-01 Thread Michael Paquier
On Fri, Nov 04, 2022 at 09:52:39AM +0900, Ian Lawrence Barwick wrote:
> CommitFest 2022-11 is currently underway, so if you are interested
> in moving this patch forward, now would be a good time to update it.

No replies after 4 weeks, so I have marked this entry as returned
with feedback.  I am still wondering what would be the best thing to
do here..
--
Michael


signature.asc
Description: PGP signature


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-11-03 Thread Ian Lawrence Barwick
2022年10月5日(水) 16:46 Michael Paquier :
>
> Hi Matthias,
>
> On Wed, Jul 27, 2022 at 02:07:05PM +0200, Matthias van de Meent wrote:
>
> My apologies for the time it took me to come back to this thread.
> > > + * To accommodate some overhead, hhis MaxXLogRecordSize value allows for
> > > s/hhis/this/.
> >
> > Will be included in the next update..
>
> v8 fails to apply.  Could you send a rebased version?
>
> As far as I recall the problems with the block image sizes are solved,
> but we still have a bit more to do in terms of the overall record
> size.  Perhaps there are some parts of the patch you'd like to
> revisit?
>
> For now, I have switched the back as waiting on author, and moved it
> to the next CF.

Hi Matthias

CommitFest 2022-11 is currently underway, so if you are interested
in moving this patch forward, now would be a good time to update it.

Thanks

Ian Barwick




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-10-05 Thread Michael Paquier
Hi Matthias,

On Wed, Jul 27, 2022 at 02:07:05PM +0200, Matthias van de Meent wrote:

My apologies for the time it took me to come back to this thread.
> > + * To accommodate some overhead, hhis MaxXLogRecordSize value allows for
> > s/hhis/this/.
> 
> Will be included in the next update..

v8 fails to apply.  Could you send a rebased version?

As far as I recall the problems with the block image sizes are solved,
but we still have a bit more to do in terms of the overall record
size.  Perhaps there are some parts of the patch you'd like to
revisit?

For now, I have switched the back as waiting on author, and moved it
to the next CF.
--
Michael


signature.asc
Description: PGP signature


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-07-27 Thread Matthias van de Meent
On Wed, 27 Jul 2022 at 11:09, Michael Paquier  wrote:
>
> On Tue, Jul 26, 2022 at 06:58:02PM +0200, Matthias van de Meent wrote:
> > - Retained the check in XLogRegisterData, so that we check against
> > integer overflows in the registerdata code instead of only an assert
> > in XLogRecordAssemble where it might be too late.
>
> Why?  The record has not been inserted yet.  I would tend to keep only
> the check at the bottom of XLogRecordAssemble(), for simplicity, and
> call it a day.

Because the sum value main_rdatalen can easily overflow in both the
current and the previous APIs, which then corrupts the WAL - one of
the two issues that I mentioned when I started the thread.

We don't re-summarize the lengths of all XLogRecData segments for the
main record data when assembling a record to keep the performance of
RecordAssemble (probably to limit the complexity when many data
segments are registered), and because I didn't want to add more
changes than necessary this check will need to be done in the place
where the overflow may occur, which is in XLogRegisterData.

> > - Kept the inline static elog-ing function (as per Andres' suggestion
> > on 2022-03-14; this decreases binary sizes)
>
> I am not really convinced that this one is worth doing.

I'm not married to that change, but I also don't see why this can't be
updated while this code is already being touched.

> +#define MaxXLogRecordSize  (1020 * 1024 * 1024)
> +
> +#define XLogRecordLengthIsValid(len) ((len) >= 0 && (len) < 
> MaxXLogRecordSize)
>
> These are used only in xloginsert.c, so we could keep them isolated.

They might be only used in xloginsert right now, but that's not the
point. This is now advertised as part of the record API spec: A record
larger than 1020MB is explicitly not supported. If it was kept
internal to xloginsert, that would be implicit and other people might
start hitting issues similar to those we're hitting right now -
records that are too large to read. Although PostgreSQL is usually the
only one generating WAL, we do support physical replication from
arbitrary PG-compatible WAL streams, which means that any compatible
WAL source could be the origin of our changes - and those need to be
aware of the assumptions we make about the WAL format.

I'm fine with also updating xlogreader.c to check this while reading
records to clarify the limits there as well, if so desired.

> + * To accommodate some overhead, hhis MaxXLogRecordSize value allows for
> s/hhis/this/.

Will be included in the next update..

> For now, I have extracted from the patch the two API changes and the
> checks for the block information for uint16, and applied this part.
> That's one less thing to worry about.

Thanks.

Kind regards,

Matthias van de Meent




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-07-27 Thread Michael Paquier
On Tue, Jul 26, 2022 at 06:58:02PM +0200, Matthias van de Meent wrote:
> - Retained the check in XLogRegisterData, so that we check against
> integer overflows in the registerdata code instead of only an assert
> in XLogRecordAssemble where it might be too late.

Why?  The record has not been inserted yet.  I would tend to keep only
the check at the bottom of XLogRecordAssemble(), for simplicity, and
call it a day.

> - Kept the inline static elog-ing function (as per Andres' suggestion
> on 2022-03-14; this decreases binary sizes)

I am not really convinced that this one is worth doing.

+#define MaxXLogRecordSize  (1020 * 1024 * 1024)
+
+#define XLogRecordLengthIsValid(len) ((len) >= 0 && (len) < MaxXLogRecordSize)

These are used only in xloginsert.c, so we could keep them isolated.

+ * To accommodate some overhead, hhis MaxXLogRecordSize value allows for
s/hhis/this/.

For now, I have extracted from the patch the two API changes and the
checks for the block information for uint16, and applied this part.
That's one less thing to worry about.
--
Michael


signature.asc
Description: PGP signature


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-07-26 Thread Matthias van de Meent
On Tue, 26 Jul 2022 at 09:20, Michael Paquier  wrote:
>
> On Mon, Jul 25, 2022 at 02:12:05PM +0300, Heikki Linnakangas wrote:
> >  Among the two problems to solve at hand, the parts where the APIs are
> >  changed and made more robust with unsigned types and where block data
> >  is not overflowed with its 16-byte limit are committable, so I'd like
> >  to do that first (still need to check its performance with some micro
> >  benchmark on XLogRegisterBufData()).
> >
> > +1. I'm not excited about adding the "unlikely()" hints, though. We have a
> > pg_attribute_cold hint in ereport(), that should be enough.
>
> Okay, that makes sense.  FWIW, I have been wondering about the
> addition of the extra condition in XLogRegisterBufData() and I did not
> see a difference on HEAD in terms of execution time or profile, with a
> micro-benchmark doing a couple of million calls in a row as of the
> following, roughly:
> [...]

Thanks for testing.

> > How large exactly is the maximum size that this gives? I'd prefer to set the
> > limit conservatively to 1020 MB, for example, with a compile-time static
> > assertion that AllocSizeIsValid(DecodeXLogRecordRequiredSpace(1020 MB)).
>
> Something like that would work, I guess.

I've gone over the patch and reviews again, and updated those places
that received comments:

- updated the MaxXLogRecordSize and XLogRecordLengthIsValid(len)
macros (now in xlogrecord.h), with a max length of the somewhat
arbitrary 1020MiB.
 This leaves room for approx. 4MiB of per-record allocation overhead
before you'd hit MaxAllocSize, and also detaches the dependency on
memutils.h.

- Retained the check in XLogRegisterData, so that we check against
integer overflows in the registerdata code instead of only an assert
in XLogRecordAssemble where it might be too late.
- Kept the inline static elog-ing function (as per Andres' suggestion
on 2022-03-14; this decreases binary sizes)
- Dropped any changes in xlogreader.h/c

Kind regards,

Matthias van de Meent


v8-0001-Add-protections-in-xlog-record-APIs-against-large.patch
Description: Binary data


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-07-26 Thread Michael Paquier
On Mon, Jul 25, 2022 at 02:12:05PM +0300, Heikki Linnakangas wrote:
> The way this is written, it would change whenever we add/remove fields in
> DecodedBkpBlock, for example. That's fragile; if you added a field in a
> back-branch, you could accidentally make the new minor version unable to
> read maximum-sized WAL records generated with an older version. I'd like the
> maximum to be more explicit.

That's a good point.

> How large exactly is the maximum size that this gives? I'd prefer to set the
> limit conservatively to 1020 MB, for example, with a compile-time static
> assertion that AllocSizeIsValid(DecodeXLogRecordRequiredSpace(1020 MB)).

Something like that would work, I guess.

>  Among the two problems to solve at hand, the parts where the APIs are
>  changed and made more robust with unsigned types and where block data
>  is not overflowed with its 16-byte limit are committable, so I'd like
>  to do that first (still need to check its performance with some micro
>  benchmark on XLogRegisterBufData()).
> 
> +1. I'm not excited about adding the "unlikely()" hints, though. We have a
> pg_attribute_cold hint in ereport(), that should be enough.

Okay, that makes sense.  FWIW, I have been wondering about the
addition of the extra condition in XLogRegisterBufData() and I did not
see a difference on HEAD in terms of execution time or profile, with a
micro-benchmark doing a couple of million calls in a row as of the
following, roughly:
// Can be anything, really..
rel = relation_open(RelationRelationId, AccessShareLock);
buffer = ReadBuffer(rel, 0);
for (i = 0 ; i < WAL_MAX_CALLS ; i++)
{
XLogBeginInsert();
XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
XLogRegisterBufData(0, buf, 10);
XLogResetInsertion();
}
ReleaseBuffer(buffer);
relation_close(rel, AccessShareLock);
--
Michael


signature.asc
Description: PGP signature


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-07-25 Thread Matthias van de Meent
On Wed, 13 Jul 2022 at 07:54, Michael Paquier  wrote:
>
> On Mon, Jul 11, 2022 at 02:26:46PM +0200, Matthias van de Meent wrote:
> > Thanks for reviewing.
>
> I think that v6 is over-engineered because there should be no need to
> add a check in xlogreader.c as long as the origin of the problem is
> blocked, no?  And the origin here is when the record is assembled.  At
> least this is the cleanest solution for HEAD, but not in the
> back-branches if we'd care about doing something with records already
> generated, and I am not sure that we need to care about other things
> than HEAD, TBH.

I would prefer it if we would fix the "cannot catch up to primary
because of oversized WAL record" issue in backbranches too. Rather
than failing to recover after failure or breaking replication streams,
I'd rather be unable to write the singular offending WAL record and
break up to one transaction.

> So it seems to me that there is no need to create a
> XLogRecMaxLength which is close to a duplicate of
> DecodeXLogRecordRequiredSpace().
>
> @@ -519,7 +549,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
>XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
>
>  {
> XLogRecData *rdt;
> -   uint32  total_len = 0;
> +   uint64  total_len = 0;
> This has no need to change.
>
> My suggestion from upthread was close to what you proposed, but I had
> in mind something simpler, as of:
>
> +   /*
> +* Ensure that xlogreader.c can read the record.
> +*/
> +   if (unlikely(!AllocSizeIsValid(DecodeXLogRecordRequiredSpace(total_len
> +   elog(ERROR, "too much WAL data");

Huh, yeah, I hadn't thought of that, but that's much simpler indeed.

> This would be the amount of data allocated by the WAL reader when it
> is possible to allocate an oversized record, related to the business
> of the circular buffer depending on if the read is blocking or not.

Yes, I see your point.

> Among the two problems to solve at hand, the parts where the APIs are
> changed and made more robust with unsigned types and where block data
> is not overflowed with its 16-byte limit are committable, so I'd like
> to do that first (still need to check its performance with some micro
> benchmark on XLogRegisterBufData()).

> The second part to block the
> creation of the assembled record is simpler, now
> DecodeXLogRecordRequiredSpace() would make the path a bit hotter,
> though we could inline it in the worst case?

I think that would be better for performance, yes.
DecodeXLogRecordRequiredSpace will already be optimized to just a
single addition by any of `-O[123]`, so keeping this indirection is
quite expensive (relative to the operation being performed).

As for your patch patch:

> +XLogRegisterData(char *data, uint32 len)
> {
> XLogRecData *rdata;
>
> Assert(begininsert_called);
>
> -if (num_rdatas >= max_rdatas)
> +if (unlikely(num_rdatas >= max_rdatas))
> elog(ERROR, "too much WAL data");
> rdata = [num_rdatas++];

XLogRegisterData is designed to be called multiple times for each
record, and this allows the user of the API to overflow the internal
mainrdata_len field if we don't check that the field does not exceed
the maximum record length (or overflow the 32-bit field). As such, I'd
still want a len-check in that function.

I'll send an updated patch by tomorrow.

- Matthias




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-07-25 Thread Heikki Linnakangas

On 13/07/2022 08:54, Michael Paquier wrote:

I think that v6 is over-engineered because there should be no need to
add a check in xlogreader.c as long as the origin of the problem is
blocked, no?  And the origin here is when the record is assembled.  At
least this is the cleanest solution for HEAD, but not in the
back-branches if we'd care about doing something with records already
generated, and I am not sure that we need to care about other things
than HEAD, TBH.  So it seems to me that there is no need to create a
XLogRecMaxLength which is close to a duplicate of
DecodeXLogRecordRequiredSpace().

@@ -519,7 +549,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)

  {
 XLogRecData *rdt;
-   uint32  total_len = 0;
+   uint64  total_len = 0;
This has no need to change.

My suggestion from upthread was close to what you proposed, but I had
in mind something simpler, as of:

+   /*
+* Ensure that xlogreader.c can read the record.
+*/
+   if (unlikely(!AllocSizeIsValid(DecodeXLogRecordRequiredSpace(total_len
+   elog(ERROR, "too much WAL data");

This would be the amount of data allocated by the WAL reader when it
is possible to allocate an oversized record, related to the business
of the circular buffer depending on if the read is blocking or not.


The way this is written, it would change whenever we add/remove fields 
in DecodedBkpBlock, for example. That's fragile; if you added a field in 
a back-branch, you could accidentally make the new minor version unable 
to read maximum-sized WAL records generated with an older version. I'd 
like the maximum to be more explicit.


How large exactly is the maximum size that this gives? I'd prefer to set 
the limit conservatively to 1020 MB, for example, with a compile-time 
static assertion that 
AllocSizeIsValid(DecodeXLogRecordRequiredSpace(1020 MB)).



Among the two problems to solve at hand, the parts where the APIs are
changed and made more robust with unsigned types and where block data
is not overflowed with its 16-byte limit are committable, so I'd like
to do that first (still need to check its performance with some micro
benchmark on XLogRegisterBufData()).


+1. I'm not excited about adding the "unlikely()" hints, though. We have 
a pg_attribute_cold hint in ereport(), that should be enough.


- Heikki




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-07-15 Thread Andres Freund
Hi,

On 2022-07-15 11:25:54 +0200, Matthias van de Meent wrote:
> On Mon, 14 Mar 2022 at 18:14, Andres Freund  wrote:
> >
> > A random thought I had while thinking about the size limits: We could use 
> > the
> > low bits of the length and xl_prev to store XLR_SPECIAL_REL_UPDATE |
> > XLR_CHECK_CONSISTENCY and give rmgrs the full 8 bit of xl_info. Which would
> > allow us to e.g. get away from needing Heap2. Which would aestethically be
> > pleasing.
> 
> I just remembered your comment while going through the xlog code and
> thought this about the same issue: We still have 2 bytes of padding in
> XLogRecord, between xl_rmid and xl_crc. Can't we instead use that
> space for rmgr-specific flags, as opposed to stealing bits from
> xl_info?

Sounds like a good idea to me. I'm not sure who is stealing bits from what
right now, but it clearly seems worthwhile to separate "flags" from "record
type within rmgr".

I think we should split it at least into three things:

1) generic per-record flags for xlog machinery (ie. XLR_SPECIAL_REL_UPDATE, 
XLR_CHECK_CONSISTENCY)
2) rmgr record type identifier (e.g. XLOG_HEAP_*)
2) rmgr specific flags (e.g. XLOG_HEAP_INIT_PAGE)

Greetings,

Andres Freund




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-07-15 Thread Matthias van de Meent
On Mon, 14 Mar 2022 at 18:14, Andres Freund  wrote:
>
> A random thought I had while thinking about the size limits: We could use the
> low bits of the length and xl_prev to store XLR_SPECIAL_REL_UPDATE |
> XLR_CHECK_CONSISTENCY and give rmgrs the full 8 bit of xl_info. Which would
> allow us to e.g. get away from needing Heap2. Which would aestethically be
> pleasing.

I just remembered your comment while going through the xlog code and
thought this about the same issue: We still have 2 bytes of padding in
XLogRecord, between xl_rmid and xl_crc. Can't we instead use that
space for rmgr-specific flags, as opposed to stealing bits from
xl_info?


Kind regards,

Matthias van de Meent




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-07-12 Thread Michael Paquier
On Mon, Jul 11, 2022 at 02:26:46PM +0200, Matthias van de Meent wrote:
> Thanks for reviewing.

I think that v6 is over-engineered because there should be no need to
add a check in xlogreader.c as long as the origin of the problem is
blocked, no?  And the origin here is when the record is assembled.  At
least this is the cleanest solution for HEAD, but not in the
back-branches if we'd care about doing something with records already
generated, and I am not sure that we need to care about other things
than HEAD, TBH.  So it seems to me that there is no need to create a
XLogRecMaxLength which is close to a duplicate of
DecodeXLogRecordRequiredSpace().

@@ -519,7 +549,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
   XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)

 {
XLogRecData *rdt;
-   uint32  total_len = 0;
+   uint64  total_len = 0;
This has no need to change.

My suggestion from upthread was close to what you proposed, but I had
in mind something simpler, as of:

+   /*
+* Ensure that xlogreader.c can read the record.
+*/
+   if (unlikely(!AllocSizeIsValid(DecodeXLogRecordRequiredSpace(total_len
+   elog(ERROR, "too much WAL data");

This would be the amount of data allocated by the WAL reader when it
is possible to allocate an oversized record, related to the business
of the circular buffer depending on if the read is blocking or not.

Among the two problems to solve at hand, the parts where the APIs are
changed and made more robust with unsigned types and where block data
is not overflowed with its 16-byte limit are committable, so I'd like
to do that first (still need to check its performance with some micro
benchmark on XLogRegisterBufData()).  The second part to block the
creation of the assembled record is simpler, now
DecodeXLogRecordRequiredSpace() would make the path a bit hotter,
though we could inline it in the worst case?
--
Michael
From 2031f15fe90c94deb21d4e41b717ada9a8bdb520 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Mon, 20 Jun 2022 10:21:22 +0200
Subject: [PATCH v7] Add protections in xlog record APIs against large numbers
 and overflows.

Before this; it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops. Emitting invalid records
was also possible through pg_logical_emit_message(), which allowed you to emit
arbitrary amounts of data up to 2GB, much higher than the replay limit of 1GB.
---
 src/include/access/xloginsert.h |  4 ++--
 src/backend/access/transam/xloginsert.c | 30 -
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index c04f77b173..aed4643d1c 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -43,12 +43,12 @@ extern void XLogBeginInsert(void);
 extern void XLogSetRecordFlags(uint8 flags);
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
 extern void XLogEnsureRecordSpace(int max_block_id, int ndatas);
-extern void XLogRegisterData(char *data, int len);
+extern void XLogRegisterData(char *data, uint32 len);
 extern void XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags);
 extern void XLogRegisterBlock(uint8 block_id, RelFileLocator *rlocator,
 			  ForkNumber forknum, BlockNumber blknum, char *page,
 			  uint8 flags);
-extern void XLogRegisterBufData(uint8 block_id, char *data, int len);
+extern void XLogRegisterBufData(uint8 block_id, char *data, uint32 len);
 extern void XLogResetInsertion(void);
 extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index f3c29fa909..3788429b70 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -348,13 +348,13 @@ XLogRegisterBlock(uint8 block_id, RelFileLocator *rlocator, ForkNumber forknum,
  * XLogRecGetData().
  */
 void
-XLogRegisterData(char *data, int len)
+XLogRegisterData(char *data, uint32 len)
 {
 	XLogRecData *rdata;
 
 	Assert(begininsert_called);
 
-	if (num_rdatas >= max_rdatas)
+	if (unlikely(num_rdatas >= max_rdatas))
 		elog(ERROR, "too much WAL data");
 	rdata = [num_rdatas++];
 
@@ -386,7 +386,7 @@ XLogRegisterData(char *data, int len)
  * limited)
  */
 void
-XLogRegisterBufData(uint8 block_id, char *data, int len)
+XLogRegisterBufData(uint8 block_id, char *data, uint32 len)
 {
 	registered_buffer *regbuf;
 	XLogRecData *rdata;
@@ -399,8 +399,16 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
 		elog(ERROR, "no block with id %d registered with WAL insertion",
 			 block_id);
 
-	if (num_rdatas >= max_rdatas)
+	/*
+	 * Check against max_rdatas and ensure we do not register more data per
+	 * buffer than can be handled by the physical data format; i.e. that
+	 * regbuf->rdata_len 

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-07-11 Thread Matthias van de Meent
On Fri, 8 Jul 2022 at 21:35, David Zhang  wrote:
>
> Hi,
>
> I tried to apply this patch v5 to current master branch but it complains,
> "git apply --check
> v5-0001-Add-protections-in-xlog-record-APIs-against-large.patch
> error: patch failed: src/include/access/xloginsert.h:43
> error: src/include/access/xloginsert.h: patch does not apply"
>
> then I checked it out before the commit
> `b0a55e43299c4ea2a9a8c757f9c26352407d0ccc` and applied this v5 patch.

The attached rebased patchset should work with master @ 2cd2569c and
REL_15_STABLE @ 53df1e28. I've also added a patch that works for PG14
and earlier, which should be correct for all versions that include
commit 2c03216d (that is, all versions back to 9.5).

> 1) both make check and make installcheck passed.
>
> 2) and I can also see this patch v5 prevents the error happens previously,
>
> "postgres=# SELECT pg_logical_emit_message(false, long, long) FROM
> repeat(repeat(' ', 1024), 1024*1023) as l(long);
> ERROR:  too much WAL data"
>
> 3) without this v5 patch, the same test will cause the standby crash
> like below, and the standby not be able to boot up after this crash.

Thanks for reviewing.

Kind regards,

Matthias van de Meent


v6-0001-Add-protections-in-xlog-record-APIs-against-large.patch
Description: Binary data


v6-0001-Add-protections-in-xlog-record-APIs-against-large.v15-patch
Description: Binary data


v6-0001-Add-protections-in-xlog-record-APIs-against-large.v14-patch
Description: Binary data


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-07-08 Thread David Zhang

Hi,

I tried to apply this patch v5 to current master branch but it complains,
"git apply --check 
v5-0001-Add-protections-in-xlog-record-APIs-against-large.patch

error: patch failed: src/include/access/xloginsert.h:43
error: src/include/access/xloginsert.h: patch does not apply"

then I checked it out before the commit 
`b0a55e43299c4ea2a9a8c757f9c26352407d0ccc` and applied this v5 patch.


1) both make check and make installcheck passed.

2) and I can also see this patch v5 prevents the error happens previously,

"postgres=# SELECT pg_logical_emit_message(false, long, long) FROM 
repeat(repeat(' ', 1024), 1024*1023) as l(long);

ERROR:  too much WAL data"

3) without this v5 patch, the same test will cause the standby crash 
like below, and the standby not be able to boot up after this crash.


"2022-07-08 12:28:16.425 PDT [2363] FATAL:  invalid memory alloc request 
size 2145388995
2022-07-08 12:28:16.426 PDT [2360] LOG:  startup process (PID 2363) 
exited with exit code 1
2022-07-08 12:28:16.426 PDT [2360] LOG:  terminating any other active 
server processes
2022-07-08 12:28:16.427 PDT [2360] LOG:  shutting down due to startup 
process failure

2022-07-08 12:28:16.428 PDT [2360] LOG:  database system is shut down"


Best regards,

--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-07-01 Thread Matthias van de Meent
On Tue, 21 Jun 2022 at 03:45, Michael Paquier  wrote:
> +   /*
> +* Ensure that xlogreader.c can read the record by ensuring that the
> +* data section of the WAL record can be allocated.
> +*/
> +   if (unlikely(!AllocSizeIsValid(total_len)))
> +   XLogErrorDataLimitExceeded();
>
> By the way, while skimming through the patch, the WAL reader seems to
> be a bit more pessimistic than this estimation, calculating the amount
> to allocate as of DecodeXLogRecordRequiredSpace(), based on the
> xl_tot_len given by a record.

I see, thanks for notifying me about that.

PFA a correction for that issue. It does copy over the value for
MaxAllocSize from memutils.h into xlogreader.h, because we need that
value in FRONTEND builds too, and memutils.h can't be included in
FRONTEND builds. One file suffixed with .backpatch that doesn't
include the function signature changes, but it is not optimized for
any stable branch[15].

-Matthias

PS. I'm not amused by the double copy we do in the xlogreader, as I
had expected we'd just read the record and point into that single
xl_rec_len-sized buffer. Apparently that's not how it works...

[15] it should apply to stable branches all the way back to
REL_15_STABLE and still work as expected. Any older than that I
haven't tested, but probably only require some updates for
XLogRecMaxLength in xlogreader.h.


v5-0001-Add-protections-in-xlog-record-APIs-against-large.patch
Description: Binary data


v5-0001-Add-protections-in-xlog-record-APIs-against-large.patch.backpatch
Description: Binary data


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-06-20 Thread Michael Paquier
On Mon, Jun 20, 2022 at 11:01:51AM +0200, Matthias van de Meent wrote:
> On Mon, 20 Jun 2022 at 07:02, Michael Paquier  wrote:
>> +   if (unlikely(!AllocSizeIsValid(total_len)))
>> +   XLogErrorDataLimitExceeded();
>> Rather than a single check at the end of XLogRecordAssemble(), you'd
>> better look after that each time total_len is added up?
>
> I was doing so previously, but there were some good arguments against that:
> 
> - Performance of XLogRecordAssemble should be impacted as little as
> possible. XLogRecordAssemble is in many hot paths, and it is highly
> unlikely this check will be hit, because nobody else has previously
> reported this issue. Any check, however unlikely, will add some
> overhead, so removing check counts reduces overhead of this patch.

Some macro-benchmarking could be in place here, and this would most
likely become noticeable when assembling a bunch of little records?

> - The user or system is unlikely to care about which specific check
> was hit, and only needs to care _that_ the check was hit. An attached
> debugger will be able to debug the internals of the xlog machinery and
> find out the specific reasons for the error, but I see no specific
> reason why the specific reason would need to be reported to the
> connection.

Okay.

+   /*
+* Ensure that xlogreader.c can read the record by ensuring that the
+* data section of the WAL record can be allocated.
+*/
+   if (unlikely(!AllocSizeIsValid(total_len)))
+   XLogErrorDataLimitExceeded();

By the way, while skimming through the patch, the WAL reader seems to
be a bit more pessimistic than this estimation, calculating the amount
to allocate as of DecodeXLogRecordRequiredSpace(), based on the
xl_tot_len given by a record.
--
Michael


signature.asc
Description: PGP signature


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-06-20 Thread Matthias van de Meent
On Mon, 20 Jun 2022 at 07:02, Michael Paquier  wrote:
>
> On Sat, Jun 11, 2022 at 09:25:48PM +0200, Matthias van de Meent wrote:
> > Did you initiate a new cluster or otherwise skip the invalid record
> > you generated when running the instance based on master? It seems to
> > me you're trying to replay the invalid record (len > MaxAllocSize),
> > and this patch does not try to fix that issue. This patch just tries
> > to forbid emitting records larger than MaxAllocSize, as per the check
> > in XLogRecordAssemble, so that we wont emit unreadable records into
> > the WAL anymore.
> >
> > Reading unreadable records still won't be possible, but that's also
> > not something I'm trying to fix.
>
> As long as you cannot generate such WAL records that should be fine as
> wAL is not reused across upgrades, so this kind of restriction is a
> no-brainer on HEAD.  The back-patching argument is not on the table
> anyway, as some of the routine signatures change with the unsigned
> arguments, because of those safety checks.

The signature change is mostly ornamental, see attached v4.backpatch.
The main reason for changing the signature is to make sure nobody can
provide a negative value, but it's not important to the patch.

>
> +   if (unlikely(num_rdatas >= max_rdatas) ||
> +   unlikely(!AllocSizeIsValid((uint64) mainrdata_len + (uint64) len)))
> +   XLogErrorDataLimitExceeded();
> [...]
> +inline void
> +XLogErrorDataLimitExceeded()
> +{
> +   elog(ERROR, "too much WAL data");
> +}
> The three checks are different, OK..

They each check slightly different things, but with the same error. In
RegisterData, it checks that the data can still be allocated and does
not overflow the register, in RegisterBlock it checks that the total
length of data registered to the block does not exceed the max value
of XLogRecordBlockHeader->data_length. I've updated the comments above
the checks so that this distinction is more clear.

> Note that static is missing.

Fixed in attached v4.patch

> +   if (unlikely(!AllocSizeIsValid(total_len)))
> +   XLogErrorDataLimitExceeded();
> Rather than a single check at the end of XLogRecordAssemble(), you'd
> better look after that each time total_len is added up?

I was doing so previously, but there were some good arguments against that:

- Performance of XLogRecordAssemble should be impacted as little as
possible. XLogRecordAssemble is in many hot paths, and it is highly
unlikely this check will be hit, because nobody else has previously
reported this issue. Any check, however unlikely, will add some
overhead, so removing check counts reduces overhead of this patch.

- The user or system is unlikely to care about which specific check
was hit, and only needs to care _that_ the check was hit. An attached
debugger will be able to debug the internals of the xlog machinery and
find out the specific reasons for the error, but I see no specific
reason why the specific reason would need to be reported to the
connection.

Kind regards,

Matthias van de Meent


v4-0001-Add-protections-in-xlog-record-APIs-against-large.patch
Description: Binary data


v4-0001-Add-protections-in-xlog-record-APIs-against-large.backpatch
Description: Binary data


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-06-19 Thread Michael Paquier
On Sat, Jun 11, 2022 at 09:25:48PM +0200, Matthias van de Meent wrote:
> Did you initiate a new cluster or otherwise skip the invalid record
> you generated when running the instance based on master? It seems to
> me you're trying to replay the invalid record (len > MaxAllocSize),
> and this patch does not try to fix that issue. This patch just tries
> to forbid emitting records larger than MaxAllocSize, as per the check
> in XLogRecordAssemble, so that we wont emit unreadable records into
> the WAL anymore.
> 
> Reading unreadable records still won't be possible, but that's also
> not something I'm trying to fix.

As long as you cannot generate such WAL records that should be fine as
wAL is not reused across upgrades, so this kind of restriction is a
no-brainer on HEAD.  The back-patching argument is not on the table
anyway, as some of the routine signatures change with the unsigned
arguments, because of those safety checks.

+   if (unlikely(num_rdatas >= max_rdatas) ||
+   unlikely(!AllocSizeIsValid((uint64) mainrdata_len + (uint64) len)))
+   XLogErrorDataLimitExceeded();
[...]
+inline void
+XLogErrorDataLimitExceeded()
+{
+   elog(ERROR, "too much WAL data");
+}
The three checks are different, OK..  Note that static is missing.

+   if (unlikely(!AllocSizeIsValid(total_len)))
+   XLogErrorDataLimitExceeded();
Rather than a single check at the end of XLogRecordAssemble(), you'd
better look after that each time total_len is added up?
--
Michael


signature.asc
Description: PGP signature


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-06-13 Thread David Zhang




A little confused here, does this patch V3 intend to solve this problem "record 
length 2145386550 at 0/360 too long"?

No, not once the record exists. But it does remove Postgres' ability
to create such records, thereby solving the problem for all systems
that generate WAL through Postgres' WAL writing APIs.


I set up a simple Primary and Standby stream replication environment, and use 
the above query to run the test for before and after patch v3. The error 
message still exist, but with different message.

Before patch v3, the error is showing below,

2022-06-10 15:32:25.307 PDT [4253] LOG:  record length 2145386550 at 0/360 
too long
2022-06-10 15:32:47.763 PDT [4257] FATAL:  terminating walreceiver process due 
to administrator command
2022-06-10 15:32:47.763 PDT [4253] LOG:  record length 2145386550 at 0/360 
too long

After patch v3, the error displays differently

2022-06-10 15:53:53.397 PDT [12848] LOG:  record length 2145386550 at 0/360 
too long
2022-06-10 15:54:07.249 PDT [12852] FATAL:  could not receive data from WAL 
stream: ERROR:  requested WAL segment 00010045 has already been 
removed
2022-06-10 15:54:07.275 PDT [12848] LOG:  record length 2145386550 at 0/360 
too long

And once the error happens, then the Standby can't continue the replication.

Did you initiate a new cluster or otherwise skip the invalid record
you generated when running the instance based on master? It seems to
me you're trying to replay the invalid record (len > MaxAllocSize),
and this patch does not try to fix that issue. This patch just tries
to forbid emitting records larger than MaxAllocSize, as per the check
in XLogRecordAssemble, so that we wont emit unreadable records into
the WAL anymore.

Reading unreadable records still won't be possible, but that's also
not something I'm trying to fix.


Thanks a lot for the clarification. My testing environment is pretty 
simple, initdb for Primary, run basebackup and set the connection string 
for Standby, then run the "pg_logical_emit_message" query and tail the 
log on standby side.


Best regards,

--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-06-11 Thread Matthias van de Meent
On Sat, 11 Jun 2022 at 01:32, David Zhang  wrote:
>
> Hi,
>
> > > MaxAllocSize is pretty easy:
> > > SELECT pg_logical_emit_message(false, long, long) FROM repeat(repeat(' ', 
> > > 1024), 1024*1023) as l(long);
> > >
> > > on a standby:
> > >
> > > 2022-03-11 16:41:59.336 PST [3639744][startup][1/0:0] LOG:  record length 
> > > 2145386550 at 0/360 too long
> >
> > Thanks for the reference. I was already playing around with 2PC log
> > records (which can theoretically contain >4GB of data); but your
> > example is much easier and takes significantly less time.
>
> A little confused here, does this patch V3 intend to solve this problem 
> "record length 2145386550 at 0/360 too long"?

No, not once the record exists. But it does remove Postgres' ability
to create such records, thereby solving the problem for all systems
that generate WAL through Postgres' WAL writing APIs.

> I set up a simple Primary and Standby stream replication environment, and use 
> the above query to run the test for before and after patch v3. The error 
> message still exist, but with different message.
>
> Before patch v3, the error is showing below,
>
> 2022-06-10 15:32:25.307 PDT [4253] LOG:  record length 2145386550 at 
> 0/360 too long
> 2022-06-10 15:32:47.763 PDT [4257] FATAL:  terminating walreceiver process 
> due to administrator command
> 2022-06-10 15:32:47.763 PDT [4253] LOG:  record length 2145386550 at 
> 0/360 too long
>
> After patch v3, the error displays differently
>
> 2022-06-10 15:53:53.397 PDT [12848] LOG:  record length 2145386550 at 
> 0/360 too long
> 2022-06-10 15:54:07.249 PDT [12852] FATAL:  could not receive data from WAL 
> stream: ERROR:  requested WAL segment 00010045 has already 
> been removed
> 2022-06-10 15:54:07.275 PDT [12848] LOG:  record length 2145386550 at 
> 0/360 too long
>
> And once the error happens, then the Standby can't continue the replication.

Did you initiate a new cluster or otherwise skip the invalid record
you generated when running the instance based on master? It seems to
me you're trying to replay the invalid record (len > MaxAllocSize),
and this patch does not try to fix that issue. This patch just tries
to forbid emitting records larger than MaxAllocSize, as per the check
in XLogRecordAssemble, so that we wont emit unreadable records into
the WAL anymore.

Reading unreadable records still won't be possible, but that's also
not something I'm trying to fix.

> Is a particular reason to say "more datas" at line 52 in patch v3?
>
> + * more datas than are being accounted for by the XLog infrastructure.

Yes. This error is thrown when you try to register a 34th block, or an
Nth rdata where the caller previously only reserved n - 1 data slots.
As such 'datas', for the num_rdatas and max_rdatas variables.

Thanks for looking at the patch.

- Matthias




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-06-10 Thread David Zhang

Hi,

> > MaxAllocSize is pretty easy:
> > SELECT pg_logical_emit_message(false, long, long) FROM 
repeat(repeat(' ', 1024), 1024*1023) as l(long);

> >
> > on a standby:
> >
> > 2022-03-11 16:41:59.336 PST [3639744][startup][1/0:0] LOG:  record 
length 2145386550 at 0/360 too long

>
> Thanks for the reference. I was already playing around with 2PC log
> records (which can theoretically contain >4GB of data); but your
> example is much easier and takes significantly less time.

A little confused here, does this patch V3 intend to solve this problem 
"record length 2145386550 at 0/360 too long"?


I set up a simple Primary and Standby stream replication environment, 
and use the above query to run the test for before and after patch v3. 
The error message still exist, but with different message.


Before patch v3, the error is showing below,

2022-06-10 15:32:25.307 PDT [4253] LOG: record length 2145386550 at 
0/360 too long
2022-06-10 15:32:47.763 PDT [4257] FATAL:  terminating walreceiver 
process due to administrator command
2022-06-10 15:32:47.763 PDT [4253] LOG:  record length 2145386550 at 
0/360 too long


After patch v3, the error displays differently

2022-06-10 15:53:53.397 PDT [12848] LOG: record length 2145386550 at 
0/360 too long
2022-06-10 15:54:07.249 PDT [12852] FATAL:  could not receive data from 
WAL stream: ERROR:  requested WAL segment 00010045 has 
already been removed
2022-06-10 15:54:07.275 PDT [12848] LOG:  record length 2145386550 at 
0/360 too long


And once the error happens, then the Standby can't continue the replication.


Is a particular reason to say "more datas" at line 52 in patch v3?

+ * more datas than are being accounted for by the XLog infrastructure.


On 2022-04-18 10:19 p.m., Michael Paquier wrote:

On Mon, Apr 18, 2022 at 05:48:50PM +0200, Matthias van de Meent wrote:

Seeing that the busiest time for PG15 - the last commitfest before the
feature freeze - has passed, could someone take another look at this?

The next minor release is three weeks away, so now would be a good
time to get that addressed.  Heikki, Andres, are you planning to look
more at what has been proposed here?
--
Michael


Thank you,

--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-04-18 Thread Michael Paquier
On Mon, Apr 18, 2022 at 05:48:50PM +0200, Matthias van de Meent wrote:
> Seeing that the busiest time for PG15 - the last commitfest before the
> feature freeze - has passed, could someone take another look at this?

The next minor release is three weeks away, so now would be a good
time to get that addressed.  Heikki, Andres, are you planning to look
more at what has been proposed here?
--
Michael


signature.asc
Description: PGP signature


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-04-18 Thread Matthias van de Meent
Seeing that the busiest time for PG15 - the last commitfest before the
feature freeze - has passed, could someone take another look at this?

The changes that were requested by Heikki and Andres have been merged
into patch v3, and I think it would be nice to fix this security issue
in the upcoming minor release(s).


-Matthias




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-03-15 Thread Matthias van de Meent
Apart from registering this in CF 2022-07 last Friday, I've also just
added this issue to the Open Items list for PG15 under "Older bugs
affecting stable branches"; as a precaution to not lose track of this
issue in the buzz of the upcoming feature freeze.

-Matthias




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-03-15 Thread Matthias van de Meent
On Mon, 14 Mar 2022 at 18:14, Andres Freund  wrote:
>
> Hi
>
> A random thought I had while thinking about the size limits: We could use the
> low bits of the length and xl_prev to store XLR_SPECIAL_REL_UPDATE |
> XLR_CHECK_CONSISTENCY and give rmgrs the full 8 bit of xl_info. Which would
> allow us to e.g. get away from needing Heap2. Which would aestethically be
> pleasing.

That would be interesting; though out of scope for this bug I'm trying to fix.

> On 2022-03-14 17:57:23 +0100, Matthias van de Meent wrote:
> > I'm not sure whether or not to include this in the test suite, though,
> > as this would require a machine with at least 1GB of memory available
> > for this test alone, and I don't know the current requirements for
> > running the test suite.
>
> We definitely shouldn't require this much RAM for the tests.
>
> It might be worth adding tests exercising edge cases around segment boundaries
> (and perhaps page boundaries) though. E.g. record headers split across pages
> and segments.
>
>
>
> > --- a/src/backend/access/transam/xloginsert.c
> > +++ b/src/backend/access/transam/xloginsert.c
> > @@ -338,10 +338,16 @@ XLogRegisterData(char *data, int len)
> >  {
> >   XLogRecData *rdata;
> >
> > - Assert(begininsert_called);
> > + Assert(begininsert_called && len >= 0 && AllocSizeIsValid(len));
>
> Shouldn't we just make the length argument unsigned?

I've applied that in the attached revision; but I'd like to note that
this makes the fix less straightforward to backpatch; as the changes
to the public function signatures shouldn't be applied in older
versions.

> > - if (num_rdatas >= max_rdatas)
> > + /*
> > +  * Check against max_rdatas; and ensure we don't fill a record with
> > +  * more data than can be replayed
> > +  */
> > + if (unlikely(num_rdatas >= max_rdatas ||
> > +  !AllocSizeIsValid((uint64) mainrdata_len + 
> > (uint64) len)))
> >   elog(ERROR, "too much WAL data");
> > +
> >   rdata = [num_rdatas++];
>
> Personally I'd write it as unlikely(num_rdatas >= max_rdatas) || unlikely(...)
> but I doubt if it makes an actual difference to the compiler.

Agreed, updated.

> >   rdata->data = data;
> > @@ -377,7 +383,7 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
> >   registered_buffer *regbuf;
> >   XLogRecData *rdata;
> >
> > - Assert(begininsert_called);
> > + Assert(begininsert_called && len >= 0 && len <= UINT16_MAX);
> >
> >   /* find the registered buffer struct */
> >   regbuf = _buffers[block_id];
> > @@ -385,8 +391,14 @@ XLogRegisterBufData(uint8 block_id, char *data, int 
> > len)
> >   elog(ERROR, "no block with id %d registered with WAL 
> > insertion",
> >block_id);
> >
> > - if (num_rdatas >= max_rdatas)
> > + /*
> > +  * Check against max_rdatas; and ensure we don't register more data 
> > per
> > +  * buffer than can be handled by the physical record format.
> > +  */
> > + if (unlikely(num_rdatas >= max_rdatas ||
> > +  regbuf->rdata_len + len > UINT16_MAX))
> >   elog(ERROR, "too much WAL data");
> > +
> >   rdata = [num_rdatas++];
>
> Given the repeated check it might be worth to just put it in a static inline
> used from the relevant places (which'd generate less code because the same
> line number would be used for all the checks).

The check itself is slightly different in those 3 places; but the
error message is shared. Do you mean to extract the elog() into a
static inline function (as attached), or did I misunderstand?


-Matthias
From 9e4b70ca096469616fd1320a02cbde129a4943b6 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Fri, 11 Mar 2022 16:16:22 +0100
Subject: [PATCH v3] Add protections in xlog record APIs against large numbers
 and overflows.

Before this; it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops.
---
 src/backend/access/transam/xloginsert.c | 56 -
 src/include/access/xloginsert.h |  4 +-
 src/include/access/xlogrecord.h |  4 ++
 3 files changed, 52 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index c260310c4c..e87642ccae 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -127,6 +127,17 @@ static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
 	   bool *topxid_included);
 static bool XLogCompressBackupBlock(char *page, uint16 hole_offset,
 	uint16 hole_length, char *dest, uint16 *dlen);
+static inline void XLogErrorDataLimitExceeded();
+
+/*
+ * Error due to exceeding the maximum size of a WAL record, or registering
+ * more datas than are being accounted for by the XLog 

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-03-14 Thread Andres Freund
Hi

A random thought I had while thinking about the size limits: We could use the
low bits of the length and xl_prev to store XLR_SPECIAL_REL_UPDATE |
XLR_CHECK_CONSISTENCY and give rmgrs the full 8 bit of xl_info. Which would
allow us to e.g. get away from needing Heap2. Which would aestethically be
pleasing.



On 2022-03-14 17:57:23 +0100, Matthias van de Meent wrote:
> I'm not sure whether or not to include this in the test suite, though,
> as this would require a machine with at least 1GB of memory available
> for this test alone, and I don't know the current requirements for
> running the test suite.

We definitely shouldn't require this much RAM for the tests.

It might be worth adding tests exercising edge cases around segment boundaries
(and perhaps page boundaries) though. E.g. record headers split across pages
and segments.



> --- a/src/backend/access/transam/xloginsert.c
> +++ b/src/backend/access/transam/xloginsert.c
> @@ -338,10 +338,16 @@ XLogRegisterData(char *data, int len)
>  {
>   XLogRecData *rdata;
>  
> - Assert(begininsert_called);
> + Assert(begininsert_called && len >= 0 && AllocSizeIsValid(len));

Shouldn't we just make the length argument unsigned?


> - if (num_rdatas >= max_rdatas)
> + /*
> +  * Check against max_rdatas; and ensure we don't fill a record with
> +  * more data than can be replayed
> +  */
> + if (unlikely(num_rdatas >= max_rdatas ||
> +  !AllocSizeIsValid((uint64) mainrdata_len + 
> (uint64) len)))
>   elog(ERROR, "too much WAL data");
> +
>   rdata = [num_rdatas++];

Personally I'd write it as unlikely(num_rdatas >= max_rdatas) || unlikely(...)
but I doubt if it makes an actual difference to the compiler.


>   rdata->data = data;
> @@ -377,7 +383,7 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
>   registered_buffer *regbuf;
>   XLogRecData *rdata;
>  
> - Assert(begininsert_called);
> + Assert(begininsert_called && len >= 0 && len <= UINT16_MAX);
>  
>   /* find the registered buffer struct */
>   regbuf = _buffers[block_id];
> @@ -385,8 +391,14 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
>   elog(ERROR, "no block with id %d registered with WAL insertion",
>block_id);
>  
> - if (num_rdatas >= max_rdatas)
> + /*
> +  * Check against max_rdatas; and ensure we don't register more data per
> +  * buffer than can be handled by the physical record format.
> +  */
> + if (unlikely(num_rdatas >= max_rdatas ||
> +  regbuf->rdata_len + len > UINT16_MAX))
>   elog(ERROR, "too much WAL data");
> +
>   rdata = [num_rdatas++];

Given the repeated check it might be worth to just put it in a static inline
used from the relevant places (which'd generate less code because the same
line number would be used for all the checks).

Greetings,

Andres Freund




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-03-14 Thread Matthias van de Meent
Thank you all for the feedback. Please find attached v2 of the
patchset, which contains updated comments and applies the suggested
changes.

On Sat, 12 Mar 2022 at 02:03, Andres Freund  wrote:
>
> Hi,
>
> On 2022-03-11 22:42:42 +0200, Heikki Linnakangas wrote:
> > Have you been able to create a test case for that? The largest record I can
> > think of is a commit record with a huge number of subtransactions, dropped
> > relations, and shared inval messages. I'm not sure if you can overflow a
> > uint32 with that, but exceeding MaxAllocSize seems possible.
>
> MaxAllocSize is pretty easy:
> SELECT pg_logical_emit_message(false, long, long) FROM repeat(repeat(' ', 
> 1024), 1024*1023) as l(long);
>
> on a standby:
>
> 2022-03-11 16:41:59.336 PST [3639744][startup][1/0:0] LOG:  record length 
> 2145386550 at 0/360 too long

Thanks for the reference. I was already playing around with 2PC log
records (which can theoretically contain >4GB of data); but your
example is much easier and takes significantly less time.

I'm not sure whether or not to include this in the test suite, though,
as this would require a machine with at least 1GB of memory available
for this test alone, and I don't know the current requirements for
running the test suite.

> > I wonder if these checks hurt performance. These are very cheap, but then
> > again, this codepath is very hot. It's probably fine, but it still worries
> > me a little. Maybe some of these could be Asserts.
>
> I wouldn't expect the added branch itself to hurt much in XLogRegisterData() -
> it should be statically predicted to be not taken with the unlikely. I don't
> think it's quite inner-loop enough for the instructions or the number of
> "concurrently out of order branches" to be a problem.
>
> FWIW, often the added elog()s are worse, because they require a decent amount
> of code and restrict the optimizer somewhat (e.g. no sibling calls, more local
> variables etc). They can't even be deduplicated because of the line-numbers
> embedded.
>
> So maybe just collapse the new elog() with the previous elog, with a common
> unlikely()?

Updated.

> > > @@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
> > > if (needs_data)
> > > {
> > > +   /* protect against overflow */
> > > +   if (unlikely(regbuf->rdata_len > UINT16_MAX))
> > > +   elog(ERROR, "too much WAL data for registered 
> > > buffer");
> > > +
> > > /*
> > >  * Link the caller-supplied rdata chain for this 
> > > buffer to the
> > >  * overall list.
>
> FWIW, this branch I'm a tad more concerned about - it's in a loop body where
> plausibly a lot of branches could be outstanding at the same time.
>
> ISTM that this could just be an assert?

This specific location has been replaced with an Assert, while
XLogRegisterBufData always does the unlikely()-ed bounds check.

Kind regards,

Matthias


v2-0001-Add-protections-in-xlog-record-APIs-against-large.patch
Description: Binary data


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-03-11 Thread Andres Freund
Hi,

On 2022-03-11 22:42:42 +0200, Heikki Linnakangas wrote:
> Have you been able to create a test case for that? The largest record I can
> think of is a commit record with a huge number of subtransactions, dropped
> relations, and shared inval messages. I'm not sure if you can overflow a
> uint32 with that, but exceeding MaxAllocSize seems possible.

MaxAllocSize is pretty easy:
SELECT pg_logical_emit_message(false, long, long) FROM repeat(repeat(' ', 
1024), 1024*1023) as l(long);

on a standby:

2022-03-11 16:41:59.336 PST [3639744][startup][1/0:0] LOG:  record length 
2145386550 at 0/360 too long



> I wonder if these checks hurt performance. These are very cheap, but then
> again, this codepath is very hot. It's probably fine, but it still worries
> me a little. Maybe some of these could be Asserts.

I wouldn't expect the added branch itself to hurt much in XLogRegisterData() -
it should be statically predicted to be not taken with the unlikely. I don't
think it's quite inner-loop enough for the instructions or the number of
"concurrently out of order branches" to be a problem.

FWIW, often the added elog()s are worse, because they require a decent amount
of code and restrict the optimizer somewhat (e.g. no sibling calls, more local
variables etc). They can't even be deduplicated because of the line-numbers
embedded.

So maybe just collapse the new elog() with the previous elog, with a common
unlikely()?


> > @@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
> > if (needs_data)
> > {
> > +   /* protect against overflow */
> > +   if (unlikely(regbuf->rdata_len > UINT16_MAX))
> > +   elog(ERROR, "too much WAL data for registered 
> > buffer");
> > +
> > /*
> >  * Link the caller-supplied rdata chain for this buffer 
> > to the
> >  * overall list.

FWIW, this branch I'm a tad more concerned about - it's in a loop body where
plausibly a lot of branches could be outstanding at the same time.

ISTM that this could just be an assert?

Greetings,

Andres Freund




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-03-11 Thread Robert Haas
On Fri, Mar 11, 2022 at 3:42 PM Heikki Linnakangas  wrote:
> Have you been able to create a test case for that? The largest record I
> can think of is a commit record with a huge number of subtransactions,
> dropped relations, and shared inval messages. I'm not sure if you can
> overflow a uint32 with that, but exceeding MaxAllocSize seems possible.

I believe that wal_level=logical can generate very large update and
delete records, especially with REPLICA IDENTITY FULL.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-03-11 Thread Heikki Linnakangas

On 11/03/2022 17:42, Matthias van de Meent wrote:

Hi,

Xlogreader limits the size of what it considers valid xlog records to
MaxAllocSize; but this is not currently enforced in the
XLogRecAssemble API. This means it is possible to assemble a record
that postgresql cannot replay.


Oops, that would be nasty.


Similarly; it is possible to repeatedly call XlogRegisterData() so as
to overflow rec->xl_tot_len; resulting in out-of-bounds reads and
writes while processing record data;


And that too.

Have you been able to create a test case for that? The largest record I 
can think of is a commit record with a huge number of subtransactions, 
dropped relations, and shared inval messages. I'm not sure if you can 
overflow a uint32 with that, but exceeding MaxAllocSize seems possible.



PFA a patch that attempts to fix both of these issues in the insertion
API; by checking against overflows and other incorrectly large values
in the relevant functions in xloginsert.c. In this patch, I've also
added a comment to the XLogRecord spec to document that xl_tot_len
should not be larger than 1GB - 1B; and why that limit exists.
diff --git a/src/backend/access/transam/xloginsert.c 
b/src/backend/access/transam/xloginsert.c
index c260310c4c..ae654177de 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -342,6 +342,11 @@ XLogRegisterData(char *data, int len)
 
 	if (num_rdatas >= max_rdatas)

elog(ERROR, "too much WAL data");
+
+   /* protect against overflow */
+   if (unlikely((uint64) mainrdata_len + (uint64) len > UINT32_MAX))
+   elog(ERROR, "too much WAL data");
+
rdata = [num_rdatas++];
 
 	rdata->data = data;


Could check for just AllocSizeValid(mainrdata_len), if we're only 
worried about the total size of the data to exceed the limit, and assume 
that each individual piece of data is smaller.


We also don't check for negative 'len'. I think that's fine, the caller 
bears some responsibility for passing valid arguments too. But maybe 
uint32 or size_t would be more appropriate here.


I wonder if these checks hurt performance. These are very cheap, but 
then again, this codepath is very hot. It's probably fine, but it still 
worries me a little. Maybe some of these could be Asserts.



@@ -387,6 +392,11 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
 
 	if (num_rdatas >= max_rdatas)

elog(ERROR, "too much WAL data");
+
+   /* protect against overflow */
+   if (unlikely((uint64) regbuf->rdata_len + (uint64) len > UINT32_MAX))
+   elog(ERROR, "too much WAL data");
+
rdata = [num_rdatas++];
 
 	rdata->data = data;


Could check "len > UINT16_MAX". As you noted in XLogRecordAssemble, 
that's the real limit. And if you check for that here, you don't need to 
check it in XLogRecordAssemble.



@@ -505,7 +515,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
   XLogRecPtr *fpw_lsn, int *num_fpi, bool 
*topxid_included)
 {
XLogRecData *rdt;
-   uint32  total_len = 0;
+   uint64  total_len = 0;
int block_id;
pg_crc32c   rdata_crc;
registered_buffer *prev_regbuf = NULL;


I don't think the change to uint64 is necessary. If all the data blocks 
are limited to 64 kB, and the number of blocks is limited, and the 
number of blocks is limited too.



@@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 
 		if (needs_data)

{
+   /* protect against overflow */
+   if (unlikely(regbuf->rdata_len > UINT16_MAX))
+   elog(ERROR, "too much WAL data for registered 
buffer");
+
/*
 * Link the caller-supplied rdata chain for this buffer 
to the
 * overall list.
@@ -836,6 +850,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
 
+	/*

+* Ensure that xlogreader.c can read the record; and check that we don't
+* accidentally overflow the size of the record.
+* */
+   if (unlikely(!AllocSizeIsValid(total_len) || total_len > UINT32_MAX))
+   elog(ERROR, "too much registered data for WAL record");
+
/*
 * Fill in the fields in the record header. Prev-link is filled in 
later,
 * once we know where in the WAL the record will be inserted. The CRC 
does


It's enough to check AllocSizeIsValid(total_len), no need to also check 
against UINT32_MAX.


- Heikki




Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-03-11 Thread Matthias van de Meent
Hi,

Xlogreader limits the size of what it considers valid xlog records to
MaxAllocSize; but this is not currently enforced in the
XLogRecAssemble API. This means it is possible to assemble a record
that postgresql cannot replay.
Similarly; it is possible to repeatedly call XlogRegisterData() so as
to overflow rec->xl_tot_len; resulting in out-of-bounds reads and
writes while processing record data;

PFA a patch that attempts to fix both of these issues in the insertion
API; by checking against overflows and other incorrectly large values
in the relevant functions in xloginsert.c. In this patch, I've also
added a comment to the XLogRecord spec to document that xl_tot_len
should not be larger than 1GB - 1B; and why that limit exists.

Kind regards,

Matthias van de Meent
From a72121b8fccbbf8c289e71a2c76801a1004f5353 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Fri, 11 Mar 2022 16:16:22 +0100
Subject: [PATCH v1] Add protections in xlog record APIs against large numbers
 and overflows.

Before this; it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops.
---
 src/backend/access/transam/xloginsert.c | 23 ++-
 src/include/access/xlogrecord.h |  4 
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index c260310c4c..ae654177de 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -342,6 +342,11 @@ XLogRegisterData(char *data, int len)
 
 	if (num_rdatas >= max_rdatas)
 		elog(ERROR, "too much WAL data");
+
+	/* protect against overflow */
+	if (unlikely((uint64) mainrdata_len + (uint64) len > UINT32_MAX))
+		elog(ERROR, "too much WAL data");
+
 	rdata = [num_rdatas++];
 
 	rdata->data = data;
@@ -387,6 +392,11 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
 
 	if (num_rdatas >= max_rdatas)
 		elog(ERROR, "too much WAL data");
+
+	/* protect against overflow */
+	if (unlikely((uint64) regbuf->rdata_len + (uint64) len > UINT32_MAX))
+		elog(ERROR, "too much WAL data");
+
 	rdata = [num_rdatas++];
 
 	rdata->data = data;
@@ -505,7 +515,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
    XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
 {
 	XLogRecData *rdt;
-	uint32		total_len = 0;
+	uint64		total_len = 0;
 	int			block_id;
 	pg_crc32c	rdata_crc;
 	registered_buffer *prev_regbuf = NULL;
@@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 
 		if (needs_data)
 		{
+			/* protect against overflow */
+			if (unlikely(regbuf->rdata_len > UINT16_MAX))
+elog(ERROR, "too much WAL data for registered buffer");
+
 			/*
 			 * Link the caller-supplied rdata chain for this buffer to the
 			 * overall list.
@@ -836,6 +850,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
 		COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
 
+	/*
+	 * Ensure that xlogreader.c can read the record; and check that we don't
+	 * accidentally overflow the size of the record.
+	 * */
+	if (unlikely(!AllocSizeIsValid(total_len) || total_len > UINT32_MAX))
+		elog(ERROR, "too much registered data for WAL record");
+
 	/*
 	 * Fill in the fields in the record header. Prev-link is filled in later,
 	 * once we know where in the WAL the record will be inserted. The CRC does
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index c1b1137aa7..950e1f22b0 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -37,6 +37,10 @@
  * The XLogRecordBlockHeader, XLogRecordDataHeaderShort and
  * XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
  * used to distinguish between block references, and the main data structs.
+ *
+ * Note that xlogreader.c limits the total size of the xl_tot_len to
+ * MaxAllocSize (1GB - 1). This means that this is also the maximum size
+ * of a single WAL record - we would be unable to replay any larger record.
  */
 typedef struct XLogRecord
 {
-- 
2.30.2