We found some corruption in a FSFS repository we were using at work. I have written a script (attached) to fix most but not all of it.
WHAT WERE THE SYMPTOMS? ----------------------- The version of mod_dav_svn being used was 1.6.9. A user got an error trying to commit one particular file, and also when attempting to check out a fresh WC. I don't have details of these. Then 'svnadmin verify' was run on the repo, and revealed several corrupt revisions, with the following three kinds of error: * svnadmin: Corrupt node-revision '5-12980.0.r12980/5571' svnadmin: Found malformed header in revision file * svnadmin: Corrupt representation '13001 1496 2082 16645 [...]' svnadmin: Malformed representation header * svnadmin: Reading one svndiff window read beyond the end of the representation There were dozens of the first kind, a few of the second kind and one or two of the third kind. The corrupt revisions were spread over a period of a few weeks, with no corrupt revisions before that or after that. We know of nothing special about that time period. ANALYSIS -------- I used both plain text searching and John Szakmeister's 'fsfsverify.py' to help analyze the revision files. Here are just the brief results of what I found. Most of the 'Corrupt node-revision' errors were due to the byte-offset part of the node-rev id being wrong. This error occurred with many different node-rev ids. A corrupt revision contained from one to several ids with wrong byte-offsets. Each particular node-rev id appeared in several different revisions after the one in which it was created, and it appeared correctly in some of them and wrongly in others, with no discernable pattern. Every time it appeared wrongly, it had the same wrong value, so there were only two variants of each node-rev id: the right one and the wrong one. The byte-offset was always fairly close to the correct value, but off by about 5 to 500 bytes. The wrong byte-offset did not point to any special place in the target revision file, such as the start or end of a data blob, so svnadmin reported 'Found malformed header'. One or two 'Corrupt node-revision' errors were wrong in another way. A directory entry reference to a subdirectory named 'X' (not its real name) had the exact value 'dir 6-12953.0.r12953/30623'. Exactly one of the node-revs created in r12953 was named 'X', and it was a directory at the right path, and its node-rev id was '0-12953.0.r12953/30403'. Therefore I concluded that that is the correct replacement. Note that both the node-id component and the byte-offset part were wrong. The 'Corrupt representation' errors were also due to a byte-offset being wrong. The second number, '1496' in the above example, is supposed to be the byte-offset in the revision file. Like the node-rev byte offsets, these were typically off by a small amount. I did not investigate or fix the 'Reading one svndiff window ...' error. THE SCRIPT TO FIX THE ERRORS ---------------------------- Usage: ./fix-repo REPO-DIR START-REVNUM Files (attached, separately and as .tgz): fix-repo # shell script, iterates over rev numbers; calls ... fixer/fix-rev.py # finds and fixes errors, using ... fixer/find_good_id.py # looks up a node-rev id, ignoring offset fixer/__init__.py # empty file, defines this as a Python module When the script sees a 'Corrupt node-revision' error message, it looks up the node-rev id ignoring its offset part. If found, it substitutes the correct full id wherever it occurs in the revision file. It expects this change to result in a checksum error being reported next, and so it substitutes the calculated checksum as reported in the error message. (In fact, it assumes that any checksum error being reported should be simply corrected in this way.) For the second type of 'Corrupt node-revision' error, I could not find a simple rule to determine when a node-rev id was wrong in this way so I hard-coded that one specific substitution into the script. When the script sees a 'Corrupt representation' error, it searches for all representations in the target revision and, if exactly one of them has the expected length, it substitutes the offset of this one. LIMITATIONS & IMPROVEMENTS -------------------------- The script's algorithm is crude and could do with improvement in several respects if it is to be used more widely. It doesn't respect checksums. When fixing a node-rev id, it should update only the corresponding checksum rather than assuming that any reported checksum error is the sole result of this fix. When fixing a representation offset, it should ensure the rep that it finds is in fact the right one, probably by checking the checksum. Detecting and fixing the second type of 'Corrupt node-revision' error could probably be automated. It doesn't replace a wrong byte-offset if the correct byte-offset has a different number of digits. I didn't encounter a need for this. This would be very difficult in the general case. It might be possible to cope with a length reduction by padding with leading zeros, or some other trick. It uses simple text search and replace, whereas it should parse the revision file to avoid the possibility of false matches of metadata within user data sections. The script is currently split into several short files and would be better as a single script. Or it could perhaps be incorporated into 'fsfsverify.py' or something else. CONCLUSION ---------- I hope this analysis and script will be useful to other people. I have heard a few reports now of this kind of corruption, and we still have no handle on how it happens. Please let me know any thoughts or questions, the results of any use you make of it, or anything I can do to help. - Julian
fix-repo
Description: application/shellscript
#!/usr/bin/env python usage = """ Print the correct FSFS node-rev id, given one that is correct except for its byte-offset part. Usage: $0 REPO-DIR FSFS-ID-WITH-BAD-OFFSET Example: Result of running 'svnadmin verify': svnadmin: Corrupt node-revision '5-12302.1-12953.r12953/29475' Invocation of this script: $ $0 svn-repo 5-12302.1-12953.r12953/29475 Output of this script: 5-12302.1-12953.r12953/29255 """ import os, sys class FixError(Exception): """An exception for any kind of inablility to repair the repository.""" pass def parse_id(id): """Return the (NODEREV, REV, OFFSET) of ID, where ID is of the form "NODEREV/OFFSET", and NODEREV is of the form "SOMETHING.rREV". """ noderev, offset = id.split('/') _, rev = noderev.split('.r') return noderev, rev, offset def rev_file_path(repo_dir, rev): return os.path.join(repo_dir, 'db', 'revs', rev) def rev_file_indexes(repo_dir, rev): """Return (ids, texts), where IDS is a dictionary of all node-rev ids defined in revision REV of the repo at REPO_DIR, in the form {noderev: full id}, and TEXTS is an array of (offset, size, expanded-size, csum [,sha1-csum, uniquifier]) tuples taken from all the "text: REV ..." lines in revision REV.""" ids = {} texts = [] for line in open(rev_file_path(repo_dir, rev)): if line.startswith('id: '): id = line.replace('id: ', '').rstrip() id_noderev, id_rev, _ = parse_id(id) assert id_rev == rev ids[id_noderev] = id if line.startswith('text: ' + rev + ' '): # also 'props:' lines? fields = line.split() texts.append(tuple(fields[2:])) return ids, texts def find_good_id(repo_dir, bad_id): """Return the node-rev id that is like BAD_ID but has the byte-offset part corrected, by looking in the revision file in the repository at REPO_DIR. ### TODO: Parsing of the rev file should skip over node-content data when searching for a line matching "id: <id>", to avoid the possibility of a false match. """ noderev, rev, bad_offset = parse_id(bad_id) ids, _ = rev_file_indexes(repo_dir, rev) if noderev not in ids: raise FixError("NodeRev Id '" + noderev + "' not found in r" + rev) return ids[noderev] def find_good_rep_header(repo_dir, rev, size): """Find a rep header that matches REV and SIZE. Return the correct offset.""" _, texts = rev_file_indexes(repo_dir, rev) n_matches = 0 for fields in texts: if fields[1] == size: offset = fields[0] n_matches += 1 if n_matches != 1: raise FixError("%d matches for r%s, size %s" % (n_matches, rev, size)) return offset if __name__ == '__main__': if len(sys.argv) == 4: repo_dir = sys.argv[1] rev = sys.argv[2] size = sys.argv[3] print "Good offset:", find_good_rep_header(repo_dir, rev, size) exit(0) if len(sys.argv) != 3: print >>sys.stderr, usage exit(1) repo_dir = sys.argv[1] bad_id = sys.argv[2] good_id = find_good_id(repo_dir, bad_id) # Replacement ID must be the same length, otherwise I don't know how to # reconstruct the file so as to preserve all offsets. # ### TODO: This check should be in the caller rather than here. if len(good_id) != len(bad_id): print >>sys.stderr, "warning: the good ID has a different length: " + \ "bad id '" + bad_id + "', good id '" + good_id + "'" print good_id
#!/usr/bin/env python usage = """ Fix a bad FSFS revision file. Usage: $0 REPO-DIR REVISION """ import os, sys, re, subprocess from subprocess import Popen, PIPE from find_good_id import FixError, rev_file_path, find_good_id, find_good_rep_header # ---------------------------------------------------------------------- # Configuration # Path and file name of the 'svnadmin' and 'svnlook' programs SVNADMIN = 'svnadmin' SVNLOOK = 'svnlook' # Verbosity: True for verbose, or False for quiet VERBOSE = True # Global dictionaries recording the fixes made fixed_ids = {} fixed_checksums = {} # ---------------------------------------------------------------------- # Functions # Print a message, only if 'verbose' mode is enabled. def verbose_print(str): if VERBOSE: print str # Echo the arguments to a log file, and also (if verbose) to standard output. def log(str): #print >>$REPO/fix-ids.log, str verbose_print(str) def run_cmd_quiet(cmd, *args): retcode = subprocess.call([cmd] + list(args)) return retcode # Execute the command given by CMD and ARGS, and also log it. def run_cmd(cmd, *args): log("CMD: " + cmd + ' ' + ' '.join(list(args))) return run_cmd_quiet(cmd, *args) def replace_in_file(filename, old, new): """Replace the string OLD with the string NEW in file FILE. Replace all occurrences. Raise an error if nothing changes.""" verbose_print("Replacing '" + old + "' in file '" + filename + "'\n" + " with '" + new + "'") # Note: we can't use '/' as a delimiter in the substitution command. run_cmd('perl', '-pi.bak', '-e', "s," + old + "," + new + ",", filename) if run_cmd_quiet('cmp', '--quiet', filename, filename + '.bak') == 0: raise FixError("'" + filename + "' is unchanged after sed substitution.") os.remove(filename + '.bak') def replace_in_rev_file(repo_dir, rev, old, new): rev_file = rev_file_path(repo_dir, rev) replace_in_file(rev_file, old, new) # Fix a node-rev ID that has a bad byte-offset part. Look up the correct # byte-offset by using the rest of the ID, which necessarily points into an # older revision or the same revision. Fix all occurrences within REV_FILE. # # ### TODO: Fix occurrences in revisions between <ID revision> and <REV>, # since the error reported for <REV> might actually exist in an older # revision that is referenced by <REV>. # def fix_id(repo_dir, rev, bad_id): # Find the GOOD_ID to replace BAD_ID. if bad_id == "6-12953.0.r12953/30623": good_id = "0-12953.0.r12953/30403" else: good_id = find_good_id(repo_dir, bad_id) # Replacement ID must be the same length, otherwise I don't know how to # reconstruct the file so as to preserve all offsets. if len(good_id) != len(bad_id): raise FixError("Can't handle a replacement ID with a different length: " + "bad id '" + bad_id + "', good id '" + good_id + "'") if good_id == bad_id: raise FixError("The ID supplied is already correct: " + "good id '" + good_id + "'") print "Fixing id: " + bad_id + " -> " + good_id replace_in_rev_file(repo_dir, rev, bad_id, good_id) fixed_ids[bad_id] = good_id def fix_checksum(repo_dir, rev, old_checksum, new_checksum): """Change all occurrences of OLD_CHECKSUM to NEW_CHECKSUM in the revision file for REV in REPO_DIR.""" assert len(old_checksum) and len(new_checksum) assert old_checksum != new_checksum print "Fixing checksum: " + old_checksum + " -> " + new_checksum replace_in_rev_file(repo_dir, rev, old_checksum, new_checksum) fixed_checksums[old_checksum] = new_checksum def fix_delta_ref(repo_dir, rev, bad_rev, bad_offset, bad_size): """Fix a "DELTA <REV> <OFFSET> <SIZE>" line in the revision file for REV in REPO_DIR, where <OFFSET> is wrong.""" good_offset = find_good_rep_header(repo_dir, bad_rev, bad_size) old_line = ' '.join(['DELTA', bad_rev, bad_offset, bad_size]) new_line = ' '.join(['DELTA', bad_rev, good_offset, bad_size]) print "Fixing delta ref:", old_line, "->", new_line replace_in_rev_file(repo_dir, rev, old_line, new_line) def handle_one_error(repo_dir, rev, error_lines): """If ERROR_LINES describes an error we know how to fix, then fix it. Return True if fixed, False if not fixed.""" line1 = error_lines[0] match = re.match(r"svn.*: Corrupt node-revision '(.*)'", line1) if match: # Fix it. bad_id = match.group(1) verbose_print(error_lines[0]) fix_id(repo_dir, rev, bad_id) # Verify again, and expect to discover a checksum mismatch. # verbose_print("Fixed an ID; now verifying to discover the checksum we need to update") # error_lines = ... # if error_lines[0] != "svn.*: Checksum mismatch while reading representation:": # raise FixError("expected a checksum mismatch after replacing the Id;" + # " instead, got this output from 'svnadmin verify -q':" + # "//".join(error_lines)) # # expected = ... # actual = ... # fix_checksum(repo_dir, rev, expected, actual) return True match = re.match(r"svn.*: Checksum mismatch while reading representation:", line1) if match: verbose_print(error_lines[0]) verbose_print(error_lines[1]) verbose_print(error_lines[2]) expected = re.match(r' *expected: *([^ ]*)', error_lines[1]).group(1) actual = re.match(r' *actual: *([^ ]*)', error_lines[2]).group(1) fix_checksum(repo_dir, rev, expected, actual) return True match = re.match(r"svn.*: Corrupt representation '([0-9]*) ([0-9]*) ([0-9]*) .*'", line1) if match: # Extract the bad reference. We expect only 'offset' is actually bad, in # the known kind of corruption that we're targetting. bad_rev = match.group(1) bad_offset = match.group(2) bad_size = match.group(3) fix_delta_ref(repo_dir, rev, bad_rev, bad_offset, bad_size) return True return False def fix_one_error(repo_dir, rev): """Verify, and if there is an error we know how to fix, then fix it. Return False if no error, True if fixed, exception if can't fix.""" # Capture the output of 'svnadmin verify' (ignoring any debug-build output) p = Popen([SVNADMIN, 'verify', '-q', '-r'+rev, repo_dir], stdout=PIPE, stderr=PIPE) _, stderr = p.communicate() svnadmin_err = [] for line in stderr.splitlines(): if line.find('(apr_err=') == -1: svnadmin_err.append(line) if svnadmin_err == []: return False try: if handle_one_error(repo_dir, rev, svnadmin_err): return True except FixError, e: print 'warning:', e print "Trying 'svnlook' instead." pass # At this point, we've got an 'svnadmin' error that we don't know how to # handle. Before giving up, see if 'svnlook' gives a different error, # one that we *can* handle. # Capture the output of 'svnlook tree' (ignoring any debug-build output) p = Popen([SVNLOOK, 'tree', '-r'+rev, repo_dir], stdout=PIPE, stderr=PIPE) _, stderr = p.communicate() svnlook_err = [] for line in stderr.splitlines(): if line.find('(apr_err=') == -1: svnlook_err.append(line) if svnlook_err == []: print 'warning: svnlook did not find an error' else: if handle_one_error(repo_dir, rev, svnlook_err): return True raise FixError("unfixable error:\n " + "\n ".join(svnadmin_err)) # ---------------------------------------------------------------------- # Main program def fix_rev(repo_dir, rev): """""" # Back up the file if not os.path.exists(rev_file_path(repo_dir, rev) + '.orig'): pass # cp -a "$FILE" "$FILE.orig" # Keep looking for verification errors in r$REV and fixing them while we can. while fix_one_error(repo_dir, rev): pass print "Revision " + rev + " verifies OK." if __name__ == '__main__': if len(sys.argv) != 3: print >>sys.stderr, usage exit(1) repo_dir = sys.argv[1] rev = sys.argv[2] try: fix_rev(repo_dir, rev) except FixError, e: print 'error:', e exit(1)
fix-fsfs-corruption-1.tgz
Description: application/compressed-tar