Re: [HACKERS] silent data loss with ext4 / all current versions

Tomas Vondra Fri, 22 Jan 2016 00:27:07 -0800

Hi,

On 01/22/2016 06:45 AM, Michael Paquier wrote:

So, I have been playing with a Linux VM with VMware Fusion and on
ext4 with data=ordered the renames are getting lost if the root
folder is not fsync. By killing-9 the VM I am able to reproduce that
really easily.


Yep. Same experience here (with qemu-kvm VMs).

Here are some comments about your patch after a look at the code.

Regarding the additions in fsync_fname() in xlog.c:
1) In InstallXLogFileSegment, rename() will be called only if
HAVE_WORKING_LINK is not used, which happens only on Windows and
cygwin. We could add it for consistency, but it should be within the
#else/#endif block. It is not critical as of now.
2) The call in RemoveXlogFile is not necessary, the rename happening
only on Windows.

Hmmm, OK. Are we sure HAVE_WORKING_LINK is false only on Windows, orcould there be some other platforms? And are we sure the file systems onthose platforms are safe without the fsync call?

That is, while the report references ext4, there may be other filesystems with the same problem - ext4 was used mostly as it's the mostwidely used Linux file system.

3) In exitArchiveRecovery if the rename is not made durable I think
it does not matter much. Even if recovery.conf is the one present
once the node restarts node would move back again to recovery, and
actually we had better move back to recovery in this case, no?

I'm strongly against this "optimization" - I'm more than happy toexchange the one fsync for not having to manually fix the database aftercrash.

I don't really see why switching back to recovery should be desirable inthis case? Imagine you have a warm/hot standby, and that you promote itto master. The clients connect, start issuing commands and then thesystem crashes and loses the recovery.conf rename. The system reboots,database performs local recovery but then starts as a standby and startsrejecting writes. That seems really weird to me.

4) StartupXLOG for the tablespace map. I don't think this one is
needed as well. Even if the tablespace map is not removed after a
power loss user would get an error telling that the file should be
removed.


Please no, for the same reasons as in (3).

5) For the one where haveBackupLabel || haveTblspcMap. If we do the
fsync, we guarantee that there is no need to do again the recovery.
But in case of a power loss, isn't it better to do the recovery again?

Why would it be better? Why should we do something twice when we don'thave to? Had this not be reliable, then the whole recovery process isfundamentally broken and we better fix it instead of merely putting aband-aid on it.

6) For the one after XLogArchiveNotify() for the last partial
segment of the old timeline, it doesn't really matter to not make the
change persistent as this is mainly done because it is useful to
identify that it is a partial segment.

OK, although I still don't quite see why that should be a reason not todo the fsync. It's not really going to give us any measurableperformance advantage (how often we do those fsyncs), so I'd vote tokeep it and make sure the partial segments are named accordingly. Lessconfusion is always better.

7) In CancelBackup, this one is not needed as well I think. We would
surely want to get back to recovery if those files remain after a
power loss.

I may be missing something, but why would we switch to recovery in thiscase?


For the ones in xlogarchive.c:
1) For KeepFileRestoredFromArchive, it does not matter here, we are
renaming a file with a temporary name to a permanent name. Once the
node restarts, we would see an extra temporary file if the rename
was not effective.

So we'll lose the segment (won't have it locally under the permanentname), as we've already restored it and won't do that again. Is thatreally a good thing to do?

2) XLogArchiveForceDone, the only bad thing that would happen here is
to leave behind a .ready file instead of a .done file. I guess that we
could have it though as an optimization to not have to archive again
this file.


Yes.


For the one in pgarch.c:
1) In pgarch_archiveDone, we could add it as an optimization to
actually let the server know that the segment has been already
archived, preventing a retry.


Not sure what you mean by "could add it as an optimization"?

In timeline.c:
1) writeTimeLineHistoryFile, it does not matter much. In the worst
case we would have just a dummy temporary file, and startup process
would come back here (in exitArchiveRecovery() we may finish with an
unnamed backup file similarly).

OK

2) In writeTimeLineHistoryFile, similarly we don't need to care
much, in WalRcvFetchTimeLineHistoryFiles recovery would take again
the same path

OK


Thoughts?

Thanks for the review and comments. I think the question is whether weonly want to do the additional fsync() only when it ultimately may leadto data loss, or even in cases where it may cause operational issues(e.g. switching back to recovery needlessly).

I'd vote for the latter, as I think it makes the database easier tooperate (less manual interventions) and the performance impact is 0 (asthose fsyncs are really rare).


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] silent data loss with ext4 / all current versions

Reply via email to