Re: Sketch of a fix for that truncation data corruption issue

Tom Lane Mon, 10 Dec 2018 22:06:51 -0800

Robert Haas <[email protected]> writes:
> On Tue, Dec 11, 2018 at 5:39 AM Tom Lane <[email protected]> wrote:
>> 9. If actual truncation boundary was different from plan, issue another
>> WAL record saying "oh, we only managed to truncate to here, not there".


> I don't entirely understand how this fix addresses the problems in
> this area,

Well, the point is to not fail if an ftruncate() call fails.  The hard
part, of course, is to adequately maintain/restore consistency when
that happens.

> ... but this step sounds particularly scary.  Nothing
> guarantees that the second WAL record ever gets replayed.

I'm not following?  How would a slave not replay that record, other
than by diverging to a new timeline?  (in which case it's okay
if it doesn't have exactly the master's state)

>> * "Only managed to truncate to here" record: write out empty heap
>> pages to fill the space from original truncation target to actual.
>> This restores the on-disk situation to be equivalent to what it
>> was in master, assuming all the dirty pages eventually got written.

> This is equivalent only in a fairly loose sense, right?

Right, specifically in the sense that logically empty pages (containing no
live tuples) get replaced by physically empty pages.  We sort of do that
now when we truncate: the truncated-away pages may not be physically
empty, but whenever we next extend the relation, we'll materialize a new
physically empty page where that page had been.

There are at least two variants of the idea that seem worth studying:
one is to fill the not-successfully-truncated space with zeroes not
valid empty pages, and the other is to not re-extend the relation at all,
but just proceed as though the original truncation had succeeded fully.
My concern about the latter is mostly that a slave following the WAL
stream might see commands to write pages that are not contiguous with
what it thinks the file EOF is, and that could lead to either bogus errors
or weird situations with "holes" in files.  Maybe we could make that
work, though.  The fill-with-zeroes idea is sort of a compromise in
between the other two, and could be better or worse depending on code
details that I've not really looked into yet.  But it'd make this
situation look much like the case where we crash between smgrextend'ing
a rel and writing a valid page into the space, which works AFAIK.

Anyway, if your assumption is that WAL replay must yield bit-for-bit
the same state of the not-truncated pages that the master would have,
then I doubt we can make this work.  In that case we're back to the
type of solution you rejected eight years ago, where we have to write
out pages before truncating them away.

                        regards, tom lane

Re: Sketch of a fix for that truncation data corruption issue

Reply via email to