On 01/23/2016 02:35 AM, Michael Paquier wrote:
On Fri, Jan 22, 2016 at 9:41 PM, Greg Stark <st...@mit.edu> wrote:
On Fri, Jan 22, 2016 at 8:26 AM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
On 01/22/2016 06:45 AM, Michael Paquier wrote:

So, I have been playing with a Linux VM with VMware Fusion and on
ext4 with data=ordered the renames are getting lost if the root
folder is not fsync. By killing-9 the VM I am able to reproduce that
really easily.


Yep. Same experience here (with qemu-kvm VMs).

I still think a better approach for this is to run the database on an
LVM volume and take lots of snapshots. No VM needed, though it doesn't
hurt. LVM volumes are below the level of the filesystem and a snapshot
captures the state of the raw blocks the filesystem has written to the
block layer. The block layer does no caching though the drive may but
neither the VM solution nor LVM would capture that.

LVM snapshots would have the advantage that you can keep running the
database and you can take lots of snapshots with relatively little
overhead. Having dozens or hundreds of snapshots would be unacceptable
performance drain in production but for testing it should be practical
and they take relatively little space -- just the blocks changed since
the snapshot was taken.

Another idea: hardcode a PANIC just after rename() with
restart_after_crash = off (this needs is IsBootstrapProcess() checks).
Once server crashes, kill-9 the VM. Then restart the VM and the
Postgres instance with a new binary that does not have the PANIC, and
see how things are moving on. There is a window of up to several
seconds after the rename() call, so I guess that this would work.

I don't see how that would improve anything, as the PANIC has no impact on the I/O requests already issued to the system. What you need is some sort of coordination between the database and the script that kills the VM (or takes a LVM snapshot).

That can be done by simply emitting a particular log message, and the "kill script" may simply watch the file (for example over SSH). This has the benefit that you can also watch for additional conditions that are difficult to check from that particular part of the code (and only kill the VM when all of them trigger - for example only on the third checkpoint since start, and such).

The reason why I was not particularly thrilled about the LVM snapshot idea is that to identify this particular data loss issue, you need to be able to reason about the expected state of the database (what transactions are committed, how many segments are there). And my understanding was that Greg's idea was merely "try to start the DB on a snapshot and see if starts / is not corrupted," which would not work with this particular issue, as the database seemed just fine - the data loss is silent. Adding the "last XLOG segment" into pg_controldata would make it easier to detect without having to track details about which transactions got committed.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to