Re: repo consistency under crashes and power failures?

2013-07-26 Thread Jeff King
On Mon, Jul 15, 2013 at 01:48:23PM -0400, Greg Troxel wrote:

 I am curious if anyone has actual experiences to share, either
 
   a report of corruption after a crash (where corruption means that
   either 1) git fsck reports worse than dangling objects or 2) some ref
   did not either point to the old place or the new place)
 
   experiments intended to provoke corruption, like dropping power during
   pushes, or forced panics in the kernel due to timers, etc.

I have quite a bit of experience with this, as I investigate all repo
corruption that we see on github.com, and have run experiments to try to
reproduce such corruption.

Our backend git systems are ext3 with journaling and data=ordered. We
run that on top of drbd, with two redundant machines sharing the block
device. If one dies, we fail over to the spare. Writes to the block
device are not considered committed until they are written to both
machines.

Git's scheme is to write objects (both loose and when receiving packs
over the wire) via tempfile, with an atomic link-into-place after close.
We do not fsync object files by default, but we do fsync packs. However,
it shouldn't matter as long as your filesystem orders data and metadata
writes (if it doesn't, you probably want to turn on object fsyncing).
So for our data=ordered filesystems, that's fine.

Ref writes have a similar fsync situation to loose object files. We
write the new ref to a tempfile, close, and then rename into place. If
the data and metadata writes are out of order, one could have problems
(but again, not a problem with data=ordered).

Most of the corruption we have seen at GitHub has been one of:

  1. Buggy non-core-git implementations that do not properly use
 tempfiles to create objects (Grit used to have this problem, but it
 is now fixed).

  2. Race conditions in examining ref state that can cause refs to be
 missed when determining reachability (thus you might prune objects
 that should be left). The worst of these is fixed in the current
 master and will be part of git v1.8.4. There are still ways that
 we can prune too much, but they are reasonably unlikely unless you
 are pruning constantly.

We did once experience some lost objects after a server failover.  After
much experimentation, we finally found out that the machine in question
had a RAID card with bad memory which would drop some writes which it
claimed to have committed after a power failure (so even fsync did not
help).

So for ordered data and metadata writes, in my experience git is quite
solid against power failures and crashes. For systems without that
guarantee, you should turn on core.fsyncobjectfiles, but I suspect you
could also see some ref corruption (and possibly index corruption, too,
as it does not fsync either).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: repo consistency under crashes and power failures?

2013-07-16 Thread Johannes Sixt
Am 7/15/2013 19:48, schrieb Greg Troxel:
 Clearly there is the possibility of creating a corrupt repository when
 receiving objects and updating refs, if a crash or power failure causes
 data not to get written to disk but that data is pointed to.  Journaling
 mitigates this, but I'd argue that programs should function safely with
 only the guarantees from POSIX.

Even under POSIX, guarantees and crash/power failure do not mesh well.
This has been under dispute recently, for example:

http://thread.gmane.org/gmane.comp.standards.posix.austin.general/7456/focus=7487

The best we can achieve with POSIX alone is to make bad consequences less
likely.

Jonathan already mentioned the knob that allows you to trade performance
for more safety.

-- Hannes
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


repo consistency under crashes and power failures?

2013-07-15 Thread Greg Troxel

Clearly there is the possibility of creating a corrupt repository when
receiving objects and updating refs, if a crash or power failure causes
data not to get written to disk but that data is pointed to.  Journaling
mitigates this, but I'd argue that programs should function safely with
only the guarantees from POSIX.

I am curious if anyone has actual experiences to share, either

  a report of corruption after a crash (where corruption means that
  either 1) git fsck reports worse than dangling objects or 2) some ref
  did not either point to the old place or the new place)

  experiments intended to provoke corruption, like dropping power during
  pushes, or forced panics in the kernel due to timers, etc.

Alternatively, is there somewhere a first-principles analysis vs POSIX
specs (such as fsyncing object files before updating refs to point to
them, which I realize has performance negatives)?

(I have not done experiments, but have observed no corruption.)

Thanks,
Greg


pgpgbG9bqc3bd.pgp
Description: PGP signature


Re: repo consistency under crashes and power failures?

2013-07-15 Thread Jonathan Nieder
Greg Troxel wrote:

 Alternatively, is there somewhere a first-principles analysis vs POSIX
 specs (such as fsyncing object files before updating refs to point to
 them, which I realize has performance negatives)?

You might be interested in the 'core.fsyncobjectfiles' setting.
git-config(1) has details.

Thanks and hope that helps,
Jonathan
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html