Re: POC: Cleaning up orphaned files using undo logs

Heikki Linnakangas Sun, 04 Aug 2019 02:17:51 -0700

I had a look at the UNDO patches athttps://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com,and at the patch to use the UNDO logs to clean up orphaned files, fromundo-2019-05-10.tgz earlier in this thread. Are these the latest ones toreview?

Thanks Thomas and Amit and others for working on this! Orphaned relfileshas been an ugly wart forever. It's a small thing, but really nice tofix that finally. This has been a long thread, and I haven't read itall, so please forgive me if I repeat stuff that's already been discussed.

There are similar issues in CREATE/DROP DATABASE code. If you crash inthe middle of CREATE DATABASE, you can be left with orphaned files inthe data directory, or if you crash in the middle of DROP DATABASE, thedata might be gone already but the pg_database entry is still there. Weshould plug those holes too.

There's a lot of stuff in the patches that are not relevant for cleaningup orphaned files. I know this cleaning up orphaned files work is mainlya vehicle to get the UNDO log committed, so that's expected. If we onlycared about orphaned files, I'm sure the patches wouldn't spend so mucheffort on concurrency, for example. Nevertheless, I think we shouldleave out some stuff that's clearly unused, for now. For example, abunch of fields in the record format: uur_block, uur_offset, uur_tuple.You can add them later, as part of the patches that actually need them,but for now they just make the patch larger to review.


Some more thoughts on the record format:

I feel that the level of abstraction is not quite right. There are abunch of fields, like uur_block, uur_offset, uur_tuple, that areprobably useful for some UNDO resource managers (zheap I presume), butseem kind of arbitrary. How is uur_tuple different from uur_payload?Should they be named more generically as uur_payload1 and uur_payload2?And why two, why not three or four different payloads? In the WAL recordformat, there's a concept of "block id", which allows you to store Nnumber of different payloads in the record, I think that would be abetter approach. Or only have one payload, and let the resource managercode divide it as it sees fit.

Many of the fields support a primitive type of compression, where afield can be omitted if it has the same value as on the first record onan UNDO page. That's handy. But again I don't like the fact that thefields have been hard-coded into the UNDO record format. I can see e.g.the relation oid to be useful for many AMs. But not all. And other AMsmight well want to store and deduplicate other things, aside from thefields that are in the patch now. I'd like to move most of the fields toAM specific code, and somehow generalize the compression. One approachwould be to let the AM store an arbitrary struct, and run it through ageneral-purpose compression algorithm, using the UNDO page's firstrecord as the "dictionary". Or make the UNDO page's first recordavailable in whole to the AM specific code, and let the AM do thededuplication. For cleaning up orphaned files, though, we don't reallycare about any of that, so I'd recommend just ripping it out for now.Compression/deduplication can be added later as a separate patch.

The orphaned-file cleanup patch doesn't actually use the uur_reloidfield. It stores the RelFileNode instead, in the paylod. I think that'sfurther evidence that the hard-coded fields in the record format are notquite right.

I don't like the way UndoFetchRecord returns a palloc'dUnpackedUndoRecord. I would prefer something similar to the xlogreaderAPI, where a new call to UndoFetchRecord invalidates the previousresult. On efficiency grounds, to avoid the palloc, but also to beconsistent with xlogreader.

In the UNDO page header, there are a bunch of fields likepd_lower/pd_upper/pd_special that are copied from the "standard" pageheader, that are unused. There's a FIXME comment about that too. Let'sremove them, there's no need for UNDO pages to look like standardrelation pages. The LSN needs to be at the beginning, to work with thebuffer manager, but that's the only requirement.

Could we leave out the UNDO and discard worker processes for now?Execute all UNDO actions immediately at rollback, and after crashrecovery. That would be fine for cleaning up orphaned files, and itwould cut down the size of the patch to review.

Can this race condition happen: Transaction A creates a table and anUNDO record to remember it. The transaction is rolled back, and the fileis removed. Another transaction, B, creates a different table, andchooses the same relfilenode. It loads the table with data, and commits.Then the system crashes. After crash recovery, the UNDO record for thefirst transaction is applied, and it removes the file that belongs tothe second table, created by transaction B.


- Heikki

Re: POC: Cleaning up orphaned files using undo logs

Reply via email to