Re: Fail hard if xlogreader.c fails on out-of-memory
On Thu, Sep 28, 2023 at 09:36:37AM +0900, Michael Paquier wrote: > If none, I propose to apply the patch to switch to palloc() instead of > palloc_extended(NO_OOM) in this code around the beginning of next > week, down to 12. Done down to 12 as of 6b18b3fe2c2f, then. -- Michael signature.asc Description: PGP signature
Re: Fail hard if xlogreader.c fails on out-of-memory
On Tue, Sep 26, 2023 at 06:28:30PM -0700, Noah Misch wrote: > On Wed, Sep 27, 2023 at 11:06:37AM +1300, Thomas Munro wrote: >> What Michael wants to do now is remove the 2004-era assumption that >> malloc failure implies bogus data. It must be pretty unlikely in a 64 >> bit world with overcommitted virtual memory, but a legitimate >> xl_tot_len can falsely end recovery and lose data, as reported from a >> production case analysed by his colleagues. In other words, we can >> actually distinguish between lack of resources and recycled bogus >> data, so why treat them the same? > > Indeed. Hard failure is fine, and ENOMEM=end-of-WAL definitely isn't. Are there any more comments and/or suggestions here? If none, I propose to apply the patch to switch to palloc() instead of palloc_extended(NO_OOM) in this code around the beginning of next week, down to 12. -- Michael signature.asc Description: PGP signature
Re: Fail hard if xlogreader.c fails on out-of-memory
On Wed, Sep 27, 2023 at 11:06:37AM +1300, Thomas Munro wrote: > On Tue, Sep 26, 2023 at 8:38 PM Michael Paquier wrote: > > Thoughts and/or comments are welcome. > > I don't have an opinion yet on your other thread about making this > stuff configurable for replicas, but for the simple crash recovery > case shown here, hard failure makes sense to me. > Recycled pages can't fool us into making a huge allocation any more. > If xl_tot_len implies more than one page but the next page's > xlp_pageaddr is too low, then either the xl_tot_len you read was > recycled garbage bits, or it was legitimate but the overwrite of the > following page didn't make it to disk; either way, we don't have a > record, so we have an end-of-wal condition. The xlp_rem_len check > defends against the second page making it to disk while the first one > still contains recycled garbage where the xl_tot_len should be*. > > What Michael wants to do now is remove the 2004-era assumption that > malloc failure implies bogus data. It must be pretty unlikely in a 64 > bit world with overcommitted virtual memory, but a legitimate > xl_tot_len can falsely end recovery and lose data, as reported from a > production case analysed by his colleagues. In other words, we can > actually distinguish between lack of resources and recycled bogus > data, so why treat them the same? Indeed. Hard failure is fine, and ENOMEM=end-of-WAL definitely isn't. > *A more detailed analysis would talk about sectors (page header is > atomic) I think the page header is atomic on POSIX-compliant filesystems but not atomic on ext4. That doesn't change the conclusion on $SUBJECT.
Re: Fail hard if xlogreader.c fails on out-of-memory
On Wed, Sep 27, 2023 at 11:06:37AM +1300, Thomas Munro wrote: > I don't have an opinion yet on your other thread about making this > stuff configurable for replicas, but for the simple crash recovery > case shown here, hard failure makes sense to me. Also, if we conclude that we're OK with just failing hard all the time for crash recovery and archive recovery on OOM, the other patch is not really required. That would be disruptive for standbys in some cases, still perhaps OK in the long-term. I am wondering if people have lost data because of this problem on production systems, actually.. It would not be possible to know that it happened until you see a page on disk that has a somewhat valid LSN, still an LSN older than the position currently being inserted, and that could show up in various forms. Even that could get hidden quickly if WAL is written at a fast pace after a crash recovery. A standby promotion at an LSN older would be unlikely as monitoring solutions discard standbys lagging behind N bytes. > *A more detailed analysis would talk about sectors (page header is > atomic), and consider whether we're only trying to defend ourselves > against recycled pages written by PostgreSQL (yes), arbitrary random > data (no, but it's probably still pretty good) or someone trying to > trick us (no, and we don't stand a chance). WAL would not be the only part of the system that would get borked if arbitrary bytes can be inserted into what's read from disk, random or not. -- Michael signature.asc Description: PGP signature
Re: Fail hard if xlogreader.c fails on out-of-memory
On Tue, Sep 26, 2023 at 8:38 PM Michael Paquier wrote: > Thoughts and/or comments are welcome. I don't have an opinion yet on your other thread about making this stuff configurable for replicas, but for the simple crash recovery case shown here, hard failure makes sense to me. Here are some interesting points in the history of this topic: 1999 30659d43: xl_len is 16 bit, fixed size buffer in later commits 2001 7d4d5c00: WAL files recycled, xlp_pageaddr added 2004 0ffe11ab: xl_len is 32 bit, dynamic buffer, malloc() failure ends redo 2005 21fda22e: xl_tot_len and xl_len co-exist 2014 2c03216d: xl_tot_len fully replaces xl_len 2018 70b4f82a: xl_tot_len > 1GB ends redo 2023 8fcb32db: don't let xl_tot_len > 1GB be logged! 2023 bae868ca: check next xlp_pageaddr, xlp_rem_len before allocating Recycled pages can't fool us into making a huge allocation any more. If xl_tot_len implies more than one page but the next page's xlp_pageaddr is too low, then either the xl_tot_len you read was recycled garbage bits, or it was legitimate but the overwrite of the following page didn't make it to disk; either way, we don't have a record, so we have an end-of-wal condition. The xlp_rem_len check defends against the second page making it to disk while the first one still contains recycled garbage where the xl_tot_len should be*. What Michael wants to do now is remove the 2004-era assumption that malloc failure implies bogus data. It must be pretty unlikely in a 64 bit world with overcommitted virtual memory, but a legitimate xl_tot_len can falsely end recovery and lose data, as reported from a production case analysed by his colleagues. In other words, we can actually distinguish between lack of resources and recycled bogus data, so why treat them the same? For comparison, if you run out of disk space during recovery we don't say "oh well, that's enough redoing for today, the computer is full, let's forget about the rest of the WAL and start accepting new transactions!". The machine running recovery has certain resource requirements relative to the machine that generated the WAL, and if they're not met it just can't do it. It's the same if it various other allocations fail. The new situation is that we are now verifying that xl_tot_len was actually logged by PostgreSQL, so if we can't allocate space for it, we can't replay the WAL. *A more detailed analysis would talk about sectors (page header is atomic), and consider whether we're only trying to defend ourselves against recycled pages written by PostgreSQL (yes), arbitrary random data (no, but it's probably still pretty good) or someone trying to trick us (no, and we don't stand a chance).
Fail hard if xlogreader.c fails on out-of-memory
Hi all, (Thomas in CC.) Now that becfbdd6c1c9 has improved the situation to detect the difference between out-of-memory and invalid WAL data in WAL, I guess that it is time to tackle the problem of what we should do when reading WAL records bit fail on out-of-memory. To summarize, currently the WAL reader APIs fail the same way if we detect some incorrect WAL record or if a memory allocation fails: an error is generated and returned back to the caller to consume. For WAL replay, not being able to make the difference between an OOM and the end-of-wal is a problem in some cases. For example, in crash recovery, failing an internal allocation will be detected as the end-of-wal, causing recovery to stop prematurely. In the worst cases, this silently corrupts clusters because not all the records generated in the local pg_wal/ have been replayed. Oops. When in standby mode, things are a bit better, because we'd just loop and wait for the next record. But, even in this case, if the startup process does a crash recovery while standby is set up, we may finish by attempting recovery from a different source than the local pg_wal/. Not strictly critical, but less optimal in some cases as we could switch to archive recovery earlier than necessary. In a different thread, I have proposed to extend the WAL reader facility so as an error code is returned to make the difference between an OOM or the end of WAL with an incorrect record: https://www.postgresql.org/message-id/ZRJ-p1dLUY0uoChc%40paquier.xyz However this requires some ABI changes, so that's not backpatchable. This leaves out what we can do for the existing back-branches, and one option is to do the simplest thing I can think of: if an allocation fails, just fail *hard*. The allocations of the WAL reader rely on palloc_extended(), so I'd like to suggest that we switch to palloc() instead. If we do so, an ERROR is promoted to a FATAL during WAL replay, which makes sure that we will never stop recovery earlier than we should, FATAL-ing before things go wrong. Note that the WAL prefetching already relies on a palloc() on HEAD in XLogReadRecordAlloc(), which would fail hard the same way on OOM. So, attached is a proposal of patch to do something down to 12. Thoughts and/or comments are welcome. -- Michael diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index a17263df20..a1363e3b8f 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -43,7 +43,7 @@ static void report_invalid_record(XLogReaderState *state, const char *fmt,...) pg_attribute_printf(2, 3); -static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength); +static void allocate_recordbuf(XLogReaderState *state, uint32 reclength); static int ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen); static void XLogReaderInvalReadState(XLogReaderState *state); @@ -155,14 +155,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, * Allocate an initial readRecordBuf of minimal size, which can later be * enlarged if necessary. */ - if (!allocate_recordbuf(state, 0)) - { - pfree(state->errormsg_buf); - pfree(state->readBuf); - pfree(state); - return NULL; - } - + allocate_recordbuf(state, 0); return state; } @@ -184,7 +177,6 @@ XLogReaderFree(XLogReaderState *state) /* * Allocate readRecordBuf to fit a record of at least the given length. - * Returns true if successful, false if out of memory. * * readRecordBufSize is set to the new buffer size. * @@ -196,7 +188,7 @@ XLogReaderFree(XLogReaderState *state) * Note: This routine should *never* be called for xl_tot_len until the header * of the record has been fully validated. */ -static bool +static void allocate_recordbuf(XLogReaderState *state, uint32 reclength) { uint32 newSize = reclength; @@ -206,15 +198,8 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength) if (state->readRecordBuf) pfree(state->readRecordBuf); - state->readRecordBuf = - (char *) palloc_extended(newSize, MCXT_ALLOC_NO_OOM); - if (state->readRecordBuf == NULL) - { - state->readRecordBufSize = 0; - return false; - } + state->readRecordBuf = (char *) palloc(newSize); state->readRecordBufSize = newSize; - return true; } /* @@ -505,9 +490,7 @@ XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversi /* Not enough space in the decode buffer. Are we allowed to allocate? */ if (allow_oversized) { - decoded = palloc_extended(required_space, MCXT_ALLOC_NO_OOM); - if (decoded == NULL) - return NULL; + decoded = palloc(required_space); decoded->oversized = true; return decoded; } @@ -815,13 +798,7 @@ restart: Assert(gotlen <= lengthof(save_copy)); Assert(gotlen <= state->readRecordBufSize); memcpy(save_copy, state->readRecordBuf, gotlen); -if (!allocate_recordbuf(state, total_len)) -{ - /* We treat this as