On 12/12/2015 11:39 PM, Andres Freund wrote:
On 2015-12-12 23:28:33 +0100, Tomas Vondra wrote:
On 12/12/2015 11:20 PM, Andres Freund wrote:
On 2015-12-12 22:14:13 +0100, Tomas Vondra wrote:
this is the second improvement proposed in the thread [1] about ext4 data
loss issue. It adds another field to control file, tracking the last known
WAL segment. This does not eliminate the data loss, just the silent part of
it when the last segment gets lost (due to forgetting the rename, deleting
it by mistake or whatever). The patch makes sure the cluster refuses to
start if that happens.

Uh, that's fairly expensive. In many cases it'll significantly
increase the number of fsyncs.

It should do exactly 1 additional fsync per WAL segment. Or do you think
otherwise?

Which is nearly doubling the number of fsyncs, for a good number of
workloads. And it does so to a separate file, i.e. it's not like
these writes and the flushes can be combined. In workloads where
pg_xlog is on a separate partition it'll add the only source of
fsyncs besides checkpoint to the main data directory.

I doubt it will make any difference in practice, at least on reasonable hardware (which you should have, if fsync performance matters to you).

But some performance testing will be necessary, I don't expect this to go in without that. It'd be helpful if you could describe the workload.

I've a bit of a hard time believing this'll be worthwhile.

The trouble is protections like this only seem worthwhile after the fact,
when something happens. I think it's reasonable protection against issues
similar to the one I reported ~2 weeks ago. YMMV.

Meh. That argument can be used to justify about everything.

Obviously we should be more careful about fsyncing files, including
the directories. I do plan come back to your recent patch.

My argument is that this is a reasonable protection against failures in that area - both our faults (in understanding the durability guarantees on a particular file system), or file system developer.

Maybe it's not, because the chance of running into exactly the same issue in this part of code is negligible.


Additionally this doesn't seem to take WAL replay into account?

I think the comparison in StartupXLOG needs to be less strict, to allow
cases when we actually replay more WAL segments. Is that what you mean?

What I mean is that the value isn't updated during recovery, afaics.
You could argue that minRecoveryPoint is that, in a way.

Oh, right. Will fix if we conclude that the general idea makes sense.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to