Robert Haas <robertmh...@gmail.com> writes: > On Tue, Dec 11, 2018 at 5:39 AM Tom Lane <t...@sss.pgh.pa.us> wrote: >> 9. If actual truncation boundary was different from plan, issue another >> WAL record saying "oh, we only managed to truncate to here, not there".
> I don't entirely understand how this fix addresses the problems in > this area, Well, the point is to not fail if an ftruncate() call fails. The hard part, of course, is to adequately maintain/restore consistency when that happens. > ... but this step sounds particularly scary. Nothing > guarantees that the second WAL record ever gets replayed. I'm not following? How would a slave not replay that record, other than by diverging to a new timeline? (in which case it's okay if it doesn't have exactly the master's state) >> * "Only managed to truncate to here" record: write out empty heap >> pages to fill the space from original truncation target to actual. >> This restores the on-disk situation to be equivalent to what it >> was in master, assuming all the dirty pages eventually got written. > This is equivalent only in a fairly loose sense, right? Right, specifically in the sense that logically empty pages (containing no live tuples) get replaced by physically empty pages. We sort of do that now when we truncate: the truncated-away pages may not be physically empty, but whenever we next extend the relation, we'll materialize a new physically empty page where that page had been. There are at least two variants of the idea that seem worth studying: one is to fill the not-successfully-truncated space with zeroes not valid empty pages, and the other is to not re-extend the relation at all, but just proceed as though the original truncation had succeeded fully. My concern about the latter is mostly that a slave following the WAL stream might see commands to write pages that are not contiguous with what it thinks the file EOF is, and that could lead to either bogus errors or weird situations with "holes" in files. Maybe we could make that work, though. The fill-with-zeroes idea is sort of a compromise in between the other two, and could be better or worse depending on code details that I've not really looked into yet. But it'd make this situation look much like the case where we crash between smgrextend'ing a rel and writing a valid page into the space, which works AFAIK. Anyway, if your assumption is that WAL replay must yield bit-for-bit the same state of the not-truncated pages that the master would have, then I doubt we can make this work. In that case we're back to the type of solution you rejected eight years ago, where we have to write out pages before truncating them away. regards, tom lane