On Sun, 8 May 2005, Tom Lane wrote:

While your original patch is buggy, it's at least fixable and has
localized, limited impact.  I don't think these schemes are safe
at all --- they put a great deal more weight on the semantics of
the filesystem than I care to do.

I'm going to try this some more, because I feel that a scheme like this that doesn't rely on scanning pg_class and the file system would in fact be safer.


The key is to A) obey the "WAL first" rule, and A) remember information about file creations over a checkpoint. The problem with the my previous suggestion was that it didn't reliably accomplish either :).

Right now we break the WAL rule because the file creation is recorded after the file is created. And the record is not flushed.

The trivial way to fix that is to write and flush the xlog record before actually creating the file. (for a more optimized way to do it, see end of message). Then we could trust that there aren't any files in the data directory that don't have a corresponding record in WAL.

But that's not enough. If a checkpoint occurs after the file is created, but before the transaction ends, WAL replay doesn't see the file creation record. That's why we need a mechanism to carry the information over the checkpoint.

We could do that by extending the ForwardFsyncRequest function or by
creating something similar to that. When a backend writes the file creation WAL record, it also sends a message to the bgwriter that says "I'm xid 1234, and I have just created file foobar/1234" (while holding CheckpointStartLock). Bgwriter keeps a list of xid/file pairs like it keeps a list of pending fsync operations. On checkpoint, the checkpointer scans the list and removes entries for transactions that have already ended, and attaches the remaining list to the checkpoint record.


WAL replay would start with the xid/file list in the checkpoint record, and update it during the replay whenever a file creation or a transaction commit/rollback record is seen. On a rollback record, files created by that transaction are deleted. At the end of WAL replay, the files that are left in the list belong to transactions that implicitly aborted, and can be deleted.

If we don't want to extend the checkpoint record, a separate WAL record works too.

Now, the more optimized way to do A:

Delay the actual file creation until it's first written to. The write needs to be WAL logged anyway, so we would just piggyback on that.

Implemented this way, I don't think there would be a significant performance hit from the scheme. We would create more ForwardFsyncRequest traffic, but not much compared to the block fsync requests we have right now.

BTW: If we allowed mdopen to create the file if it doesn't exist already, would we need the current file creation xlog record for anything? (I'm not suggesting to do that, just trying to get more insight)

- Heikki

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Reply via email to