Fixing FSFS 'Corrupt node-revision' and 'Corrupt representation' errors

Julian Foad Wed, 06 Oct 2010 03:22:24 -0700

We found some corruption in a FSFS repository we were using at work.  I
have written a script (attached) to fix most but not all of it.



WHAT WERE THE SYMPTOMS?
-----------------------

The version of mod_dav_svn being used was 1.6.9.

A user got an error trying to commit one particular file, and also when
attempting to check out a fresh WC.  I don't have details of these.

Then 'svnadmin verify' was run on the repo, and revealed several corrupt
revisions, with the following three kinds of error:

  * svnadmin: Corrupt node-revision '5-12980.0.r12980/5571'
    svnadmin: Found malformed header in revision file

  * svnadmin: Corrupt representation '13001 1496 2082 16645 [...]'
    svnadmin: Malformed representation header

  * svnadmin: Reading one svndiff window read beyond the end of the
representation

There were dozens of the first kind, a few of the second kind and one or
two of the third kind.

The corrupt revisions were spread over a period of a few weeks, with no
corrupt revisions before that or after that.  We know of nothing special
about that time period.


ANALYSIS
--------

I used both plain text searching and John Szakmeister's 'fsfsverify.py'
to help analyze the revision files.  Here are just the brief results of
what I found.

Most of the 'Corrupt node-revision' errors were due to the byte-offset
part of the node-rev id being wrong.  This error occurred with many
different node-rev ids.  A corrupt revision contained from one to
several ids with wrong byte-offsets.  Each particular node-rev id
appeared in several different revisions after the one in which it was
created, and it appeared correctly in some of them and wrongly in
others, with no discernable pattern.  Every time it appeared wrongly, it
had the same wrong value, so there were only two variants of each
node-rev id: the right one and the wrong one.  The byte-offset was
always fairly close to the correct value, but off by about 5 to 500
bytes.  The wrong byte-offset did not point to any special place in the
target revision file, such as the start or end of a data blob, so
svnadmin reported 'Found malformed header'.

One or two 'Corrupt node-revision' errors were wrong in another way.  A
directory entry reference to a subdirectory named 'X' (not its real
name) had the exact value 'dir 6-12953.0.r12953/30623'.  Exactly one of
the node-revs created in r12953 was named 'X', and it was a directory at
the right path, and its node-rev id was '0-12953.0.r12953/30403'.
Therefore I concluded that that is the correct replacement.  Note that
both the node-id component and the byte-offset part were wrong.

The 'Corrupt representation' errors were also due to a byte-offset being
wrong.  The second number, '1496' in the above example, is supposed to
be the byte-offset in the revision file.  Like the node-rev byte
offsets, these were typically off by a small amount.

I did not investigate or fix the 'Reading one svndiff window ...' error.


THE SCRIPT TO FIX THE ERRORS
----------------------------

Usage:
  ./fix-repo REPO-DIR START-REVNUM

Files (attached, separately and as .tgz):
  fix-repo        # shell script, iterates over rev numbers; calls ...
  fixer/fix-rev.py        # finds and fixes errors, using ...
  fixer/find_good_id.py   # looks up a node-rev id, ignoring offset
  fixer/__init__.py       # empty file, defines this as a Python module

When the script sees a 'Corrupt node-revision' error message, it looks
up the node-rev id ignoring its offset part.  If found, it substitutes
the correct full id wherever it occurs in the revision file.  It expects
this change to result in a checksum error being reported next, and so it
substitutes the calculated checksum as reported in the error message.
(In fact, it assumes that any checksum error being reported should be
simply corrected in this way.)

For the second type of 'Corrupt node-revision' error, I could not find a
simple rule to determine when a node-rev id was wrong in this way so I
hard-coded that one specific substitution into the script.

When the script sees a 'Corrupt representation' error, it searches for
all representations in the target revision and, if exactly one of them
has the expected length, it substitutes the offset of this one.


LIMITATIONS & IMPROVEMENTS
--------------------------

The script's algorithm is crude and could do with improvement in several
respects if it is to be used more widely.

It doesn't respect checksums.  When fixing a node-rev id, it should
update only the corresponding checksum rather than assuming that any
reported checksum error is the sole result of this fix.  When fixing a
representation offset, it should ensure the rep that it finds is in fact
the right one, probably by checking the checksum.

Detecting and fixing the second type of 'Corrupt node-revision' error
could probably be automated.

It doesn't replace a wrong byte-offset if the correct byte-offset has a
different number of digits.  I didn't encounter a need for this.  This
would be very difficult in the general case.  It might be possible to
cope with a length reduction by padding with leading zeros, or some
other trick.

It uses simple text search and replace, whereas it should parse the
revision file to avoid the possibility of false matches of metadata
within user data sections.

The script is currently split into several short files and would be
better as a single script.  Or it could perhaps be incorporated into
'fsfsverify.py' or something else.


CONCLUSION
----------

I hope this analysis and script will be useful to other people.  I have
heard a few reports now of this kind of corruption, and we still have no
handle on how it happens.

Please let me know any thoughts or questions, the results of any use you
make of it, or anything I can do to help.


- Julian

fix-repo
Description: application/shellscript

#!/usr/bin/env python

usage = """
Print the correct FSFS node-rev id, given one that is correct except for
its byte-offset part.
Usage: $0 REPO-DIR FSFS-ID-WITH-BAD-OFFSET
Example:
  Result of running 'svnadmin verify':
    svnadmin: Corrupt node-revision '5-12302.1-12953.r12953/29475'
  Invocation of this script:
    $ $0 svn-repo 5-12302.1-12953.r12953/29475
  Output of this script:
    5-12302.1-12953.r12953/29255
"""

import os, sys

class FixError(Exception):
  """An exception for any kind of inablility to repair the repository."""
  pass

def parse_id(id):
  """Return the (NODEREV, REV, OFFSET) of ID, where ID is of the form
     "NODEREV/OFFSET", and NODEREV is of the form "SOMETHING.rREV".
  """
  noderev, offset = id.split('/')
  _, rev = noderev.split('.r')
  return noderev, rev, offset

def rev_file_path(repo_dir, rev):
  return os.path.join(repo_dir, 'db', 'revs', rev)

def rev_file_indexes(repo_dir, rev):
  """Return (ids, texts), where IDS is a dictionary of all node-rev ids
     defined in revision REV of the repo at REPO_DIR, in the form
     {noderev: full id}, and TEXTS is an array of
     (offset, size, expanded-size, csum [,sha1-csum, uniquifier]) tuples
     taken from all the "text: REV ..." lines in revision REV."""
  ids = {}
  texts = []
  for line in open(rev_file_path(repo_dir, rev)):
    if line.startswith('id: '):
      id = line.replace('id: ', '').rstrip()
      id_noderev, id_rev, _ = parse_id(id)
      assert id_rev == rev
      ids[id_noderev] = id
    if line.startswith('text: ' + rev + ' '):  # also 'props:' lines?
      fields = line.split()
      texts.append(tuple(fields[2:]))
  return ids, texts

def find_good_id(repo_dir, bad_id):
  """Return the node-rev id that is like BAD_ID but has the byte-offset
     part corrected, by looking in the revision file in the repository
     at REPO_DIR.

     ### TODO: Parsing of the rev file should skip over node-content data
         when searching for a line matching "id: <id>", to avoid the
         possibility of a false match.
  """

  noderev, rev, bad_offset = parse_id(bad_id)
  ids, _ = rev_file_indexes(repo_dir, rev)

  if noderev not in ids:
    raise FixError("NodeRev Id '" + noderev + "' not found in r" + rev)
  return ids[noderev]

def find_good_rep_header(repo_dir, rev, size):
  """Find a rep header that matches REV and SIZE.
     Return the correct offset."""
  _, texts = rev_file_indexes(repo_dir, rev)
  n_matches = 0
  for fields in texts:
    if fields[1] == size:
      offset = fields[0]
      n_matches += 1
  if n_matches != 1:
    raise FixError("%d matches for r%s, size %s" % (n_matches, rev, size))
  return offset


if __name__ == '__main__':

  if len(sys.argv) == 4:
    repo_dir = sys.argv[1]
    rev = sys.argv[2]
    size = sys.argv[3]
    print "Good offset:", find_good_rep_header(repo_dir, rev, size)
    exit(0)

  if len(sys.argv) != 3:
    print >>sys.stderr, usage
    exit(1)

  repo_dir = sys.argv[1]
  bad_id = sys.argv[2]

  good_id = find_good_id(repo_dir, bad_id)

  # Replacement ID must be the same length, otherwise I don't know how to
  # reconstruct the file so as to preserve all offsets.
  # ### TODO: This check should be in the caller rather than here.
  if len(good_id) != len(bad_id):
    print >>sys.stderr, "warning: the good ID has a different length: " + \
                        "bad id '" + bad_id + "', good id '" + good_id + "'"

  print good_id

#!/usr/bin/env python

usage = """
Fix a bad FSFS revision file.
Usage: $0 REPO-DIR REVISION
"""

import os, sys, re, subprocess
from subprocess import Popen, PIPE

from find_good_id import FixError, rev_file_path, find_good_id, find_good_rep_header


# ----------------------------------------------------------------------
# Configuration

# Path and file name of the 'svnadmin' and 'svnlook' programs
SVNADMIN = 'svnadmin'
SVNLOOK = 'svnlook'

# Verbosity: True for verbose, or False for quiet
VERBOSE = True

# Global dictionaries recording the fixes made
fixed_ids = {}
fixed_checksums = {}


# ----------------------------------------------------------------------
# Functions

# Print a message, only if 'verbose' mode is enabled.
def verbose_print(str):
  if VERBOSE:
    print str

# Echo the arguments to a log file, and also (if verbose) to standard output.
def log(str):
  #print >>$REPO/fix-ids.log, str
  verbose_print(str)

def run_cmd_quiet(cmd, *args):
  retcode = subprocess.call([cmd] + list(args))
  return retcode

# Execute the command given by CMD and ARGS, and also log it.
def run_cmd(cmd, *args):
  log("CMD: " + cmd + ' ' + ' '.join(list(args)))
  return run_cmd_quiet(cmd, *args)

def replace_in_file(filename, old, new):
  """Replace the string OLD with the string NEW in file FILE.
     Replace all occurrences.  Raise an error if nothing changes."""

  verbose_print("Replacing '" + old + "' in file '" + filename + "'\n" +
                "    with  '" + new + "'")
  # Note: we can't use '/' as a delimiter in the substitution command.
  run_cmd('perl', '-pi.bak', '-e', "s," + old + "," + new + ",", filename)
  if run_cmd_quiet('cmp', '--quiet', filename, filename + '.bak') == 0:
    raise FixError("'" + filename + "' is unchanged after sed substitution.")
  os.remove(filename + '.bak')

def replace_in_rev_file(repo_dir, rev, old, new):
  rev_file = rev_file_path(repo_dir, rev)
  replace_in_file(rev_file, old, new)

# Fix a node-rev ID that has a bad byte-offset part.  Look up the correct
# byte-offset by using the rest of the ID, which necessarily points into an
# older revision or the same revision.  Fix all occurrences within REV_FILE.
#
# ### TODO: Fix occurrences in revisions between <ID revision> and <REV>,
#   since the error reported for <REV> might actually exist in an older
#   revision that is referenced by <REV>.
#
def fix_id(repo_dir, rev, bad_id):

  # Find the GOOD_ID to replace BAD_ID.
  if bad_id == "6-12953.0.r12953/30623":
    good_id = "0-12953.0.r12953/30403"
  else:
    good_id = find_good_id(repo_dir, bad_id)

  # Replacement ID must be the same length, otherwise I don't know how to
  # reconstruct the file so as to preserve all offsets.
  if len(good_id) != len(bad_id):
    raise FixError("Can't handle a replacement ID with a different length: " +
                   "bad id '" + bad_id + "', good id '" + good_id + "'")

  if good_id == bad_id:
    raise FixError("The ID supplied is already correct: " +
                   "good id '" + good_id + "'")

  print "Fixing id: " + bad_id + " -> " + good_id
  replace_in_rev_file(repo_dir, rev, bad_id, good_id)
  fixed_ids[bad_id] = good_id

def fix_checksum(repo_dir, rev, old_checksum, new_checksum):
  """Change all occurrences of OLD_CHECKSUM to NEW_CHECKSUM in the revision
     file for REV in REPO_DIR."""

  assert len(old_checksum) and len(new_checksum)
  assert old_checksum != new_checksum

  print "Fixing checksum: " + old_checksum + " -> " + new_checksum
  replace_in_rev_file(repo_dir, rev, old_checksum, new_checksum)
  fixed_checksums[old_checksum] = new_checksum

def fix_delta_ref(repo_dir, rev, bad_rev, bad_offset, bad_size):
  """Fix a "DELTA <REV> <OFFSET> <SIZE>" line in the revision file for REV
     in REPO_DIR, where <OFFSET> is wrong."""
  good_offset = find_good_rep_header(repo_dir, bad_rev, bad_size)
  old_line = ' '.join(['DELTA', bad_rev, bad_offset, bad_size])
  new_line = ' '.join(['DELTA', bad_rev, good_offset, bad_size])
  print "Fixing delta ref:", old_line, "->", new_line
  replace_in_rev_file(repo_dir, rev, old_line, new_line)


def handle_one_error(repo_dir, rev, error_lines):
  """If ERROR_LINES describes an error we know how to fix, then fix it.
     Return True if fixed, False if not fixed."""

  line1 = error_lines[0]
  match = re.match(r"svn.*: Corrupt node-revision '(.*)'", line1)
  if match:
    # Fix it.
    bad_id = match.group(1)
    verbose_print(error_lines[0])
    fix_id(repo_dir, rev, bad_id)

    # Verify again, and expect to discover a checksum mismatch.
    # verbose_print("Fixed an ID; now verifying to discover the checksum we need to update")
    # error_lines = ...
    # if error_lines[0] != "svn.*: Checksum mismatch while reading representation:":
    #   raise FixError("expected a checksum mismatch after replacing the Id;" +
    #                  "  instead, got this output from 'svnadmin verify -q':" +
    #                  "//".join(error_lines))
    #
    # expected = ...
    # actual   = ...
    # fix_checksum(repo_dir, rev, expected, actual)

    return True

  match = re.match(r"svn.*: Checksum mismatch while reading representation:", line1)
  if match:
    verbose_print(error_lines[0])
    verbose_print(error_lines[1])
    verbose_print(error_lines[2])
    expected = re.match(r' *expected: *([^ ]*)', error_lines[1]).group(1)
    actual   = re.match(r' *actual: *([^ ]*)',   error_lines[2]).group(1)
    fix_checksum(repo_dir, rev, expected, actual)
    return True

  match = re.match(r"svn.*: Corrupt representation '([0-9]*) ([0-9]*) ([0-9]*) .*'", line1)
  if match:
    # Extract the bad reference. We expect only 'offset' is actually bad, in
    # the known kind of corruption that we're targetting.
    bad_rev = match.group(1)
    bad_offset = match.group(2)
    bad_size = match.group(3)
    fix_delta_ref(repo_dir, rev, bad_rev, bad_offset, bad_size)
    return True

  return False

def fix_one_error(repo_dir, rev):
  """Verify, and if there is an error we know how to fix, then fix it.
     Return False if no error, True if fixed, exception if can't fix."""

  # Capture the output of 'svnadmin verify' (ignoring any debug-build output)
  p = Popen([SVNADMIN, 'verify', '-q', '-r'+rev, repo_dir], stdout=PIPE, stderr=PIPE)
  _, stderr = p.communicate()
  svnadmin_err = []
  for line in stderr.splitlines():
    if line.find('(apr_err=') == -1:
      svnadmin_err.append(line)

  if svnadmin_err == []:
    return False

  try:
    if handle_one_error(repo_dir, rev, svnadmin_err):
      return True
  except FixError, e:
    print 'warning:', e
    print "Trying 'svnlook' instead."
    pass

  # At this point, we've got an 'svnadmin' error that we don't know how to
  # handle.  Before giving up, see if 'svnlook' gives a different error,
  # one that we *can* handle.

  # Capture the output of 'svnlook tree' (ignoring any debug-build output)
  p = Popen([SVNLOOK, 'tree', '-r'+rev, repo_dir], stdout=PIPE, stderr=PIPE)
  _, stderr = p.communicate()
  svnlook_err = []
  for line in stderr.splitlines():
    if line.find('(apr_err=') == -1:
      svnlook_err.append(line)

  if svnlook_err == []:
    print 'warning: svnlook did not find an error'
  else:
    if handle_one_error(repo_dir, rev, svnlook_err):
      return True

  raise FixError("unfixable error:\n  " + "\n  ".join(svnadmin_err))


# ----------------------------------------------------------------------
# Main program

def fix_rev(repo_dir, rev):
  """"""

  # Back up the file
  if not os.path.exists(rev_file_path(repo_dir, rev) + '.orig'):
    pass
    # cp -a "$FILE" "$FILE.orig"

  # Keep looking for verification errors in r$REV and fixing them while we can.
  while fix_one_error(repo_dir, rev):
    pass
  print "Revision " + rev + " verifies OK."


if __name__ == '__main__':

  if len(sys.argv) != 3:
    print >>sys.stderr, usage
    exit(1)

  repo_dir = sys.argv[1]
  rev = sys.argv[2]

  try:
    fix_rev(repo_dir, rev)
  except FixError, e:
    print 'error:', e
    exit(1)

fix-fsfs-corruption-1.tgz
Description: application/compressed-tar

Fixing FSFS 'Corrupt node-revision' and 'Corrupt representation' errors

Reply via email to