I had a look at the UNDO patches at
https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com,
and at the patch to use the UNDO logs to clean up orphaned files, from
undo-2019-05-10.tgz earlier in this thread. Are these the latest ones to
review?
Thanks Thomas and Amit and others for working on this! Orphaned relfiles
has been an ugly wart forever. It's a small thing, but really nice to
fix that finally. This has been a long thread, and I haven't read it
all, so please forgive me if I repeat stuff that's already been discussed.
There are similar issues in CREATE/DROP DATABASE code. If you crash in
the middle of CREATE DATABASE, you can be left with orphaned files in
the data directory, or if you crash in the middle of DROP DATABASE, the
data might be gone already but the pg_database entry is still there. We
should plug those holes too.
There's a lot of stuff in the patches that are not relevant for cleaning
up orphaned files. I know this cleaning up orphaned files work is mainly
a vehicle to get the UNDO log committed, so that's expected. If we only
cared about orphaned files, I'm sure the patches wouldn't spend so much
effort on concurrency, for example. Nevertheless, I think we should
leave out some stuff that's clearly unused, for now. For example, a
bunch of fields in the record format: uur_block, uur_offset, uur_tuple.
You can add them later, as part of the patches that actually need them,
but for now they just make the patch larger to review.
Some more thoughts on the record format:
I feel that the level of abstraction is not quite right. There are a
bunch of fields, like uur_block, uur_offset, uur_tuple, that are
probably useful for some UNDO resource managers (zheap I presume), but
seem kind of arbitrary. How is uur_tuple different from uur_payload?
Should they be named more generically as uur_payload1 and uur_payload2?
And why two, why not three or four different payloads? In the WAL record
format, there's a concept of "block id", which allows you to store N
number of different payloads in the record, I think that would be a
better approach. Or only have one payload, and let the resource manager
code divide it as it sees fit.
Many of the fields support a primitive type of compression, where a
field can be omitted if it has the same value as on the first record on
an UNDO page. That's handy. But again I don't like the fact that the
fields have been hard-coded into the UNDO record format. I can see e.g.
the relation oid to be useful for many AMs. But not all. And other AMs
might well want to store and deduplicate other things, aside from the
fields that are in the patch now. I'd like to move most of the fields to
AM specific code, and somehow generalize the compression. One approach
would be to let the AM store an arbitrary struct, and run it through a
general-purpose compression algorithm, using the UNDO page's first
record as the "dictionary". Or make the UNDO page's first record
available in whole to the AM specific code, and let the AM do the
deduplication. For cleaning up orphaned files, though, we don't really
care about any of that, so I'd recommend just ripping it out for now.
Compression/deduplication can be added later as a separate patch.
The orphaned-file cleanup patch doesn't actually use the uur_reloid
field. It stores the RelFileNode instead, in the paylod. I think that's
further evidence that the hard-coded fields in the record format are not
quite right.
I don't like the way UndoFetchRecord returns a palloc'd
UnpackedUndoRecord. I would prefer something similar to the xlogreader
API, where a new call to UndoFetchRecord invalidates the previous
result. On efficiency grounds, to avoid the palloc, but also to be
consistent with xlogreader.
In the UNDO page header, there are a bunch of fields like
pd_lower/pd_upper/pd_special that are copied from the "standard" page
header, that are unused. There's a FIXME comment about that too. Let's
remove them, there's no need for UNDO pages to look like standard
relation pages. The LSN needs to be at the beginning, to work with the
buffer manager, but that's the only requirement.
Could we leave out the UNDO and discard worker processes for now?
Execute all UNDO actions immediately at rollback, and after crash
recovery. That would be fine for cleaning up orphaned files, and it
would cut down the size of the patch to review.
Can this race condition happen: Transaction A creates a table and an
UNDO record to remember it. The transaction is rolled back, and the file
is removed. Another transaction, B, creates a different table, and
chooses the same relfilenode. It loads the table with data, and commits.
Then the system crashes. After crash recovery, the UNDO record for the
first transaction is applied, and it removes the file that belongs to
the second table, created by transaction B.
- Heikki