Re: Fixing FSFS 'Corrupt node-revision' and 'Corrupt representation' errors

2010-10-12 Thread John Szakmeister
On Wed, Oct 6, 2010 at 6:21 AM, Julian Foad  wrote:
[snip]
> Most of the 'Corrupt node-revision' errors were due to the byte-offset
> part of the node-rev id being wrong.  This error occurred with many
> different node-rev ids.  A corrupt revision contained from one to
> several ids with wrong byte-offsets.  Each particular node-rev id
> appeared in several different revisions after the one in which it was
> created, and it appeared correctly in some of them and wrongly in
> others, with no discernable pattern.  Every time it appeared wrongly, it
> had the same wrong value, so there were only two variants of each
> node-rev id: the right one and the wrong one.  The byte-offset was
> always fairly close to the correct value, but off by about 5 to 500
> bytes.  The wrong byte-offset did not point to any special place in the
> target revision file, such as the start or end of a data blob, so
> svnadmin reported 'Found malformed header'.

I ran across the noderev offset being wrong for the first time just
recently (well, at least in this way).  I can check, but I believe the
offset listed on the noderev I fixed recently was considerably off.
I'm still helping out that individual with something else, so I'll
check on that.

> One or two 'Corrupt node-revision' errors were wrong in another way.  A
> directory entry reference to a subdirectory named 'X' (not its real
> name) had the exact value 'dir 6-12953.0.r12953/30623'.  Exactly one of
> the node-revs created in r12953 was named 'X', and it was a directory at
> the right path, and its node-rev id was '0-12953.0.r12953/30403'.
> Therefore I concluded that that is the correct replacement.  Note that
> both the node-id component and the byte-offset part were wrong.

Wow!  I haven't seen one with both the node-id component and offset
wrong.  Nasty.

> The 'Corrupt representation' errors were also due to a byte-offset being
> wrong.  The second number, '1496' in the above example, is supposed to
> be the byte-offset in the revision file.  Like the node-rev byte
> offsets, these were typically off by a small amount.

I've seen this one occasionally too, but there is almost always some
other error in the file.  So I'm not sure if it's the result of a
single problem or more than one. :-(

> I did not investigate or fix the 'Reading one svndiff window ...' error.

It's been a while since I fixed one of these, but IIRC, it was because
the noderep said there were X bytes of data in the delta stream, and Y
bytes were actually there (where Y > X).

[snip]
> The script is currently split into several short files and would be
> better as a single script.  Or it could perhaps be incorporated into
> 'fsfsverify.py' or something else.

I'm certainly not opposed to this.  And, I'm willing to change the
license of fsfsverify to Apache, if that helps things.

FWIW, I do have some changes sitting in my working copy to help cope
with the new rev format.  However, I think it might be better to think
about doing something very different, and teach fsfsverify how to use
the whole repository.  I originally wrote this to help me analyze a
particular rev, but I've found you really need something better when
it comes to making sure you've got the right fix in place.  And it
would really help with things like truncating nodes revs (and
truncating any references, etc) and verifying the delta streams (since
we could actually regenerate the data).

-John


Fixing FSFS 'Corrupt node-revision' and 'Corrupt representation' errors

2010-10-06 Thread Julian Foad
We found some corruption in a FSFS repository we were using at work.  I
have written a script (attached) to fix most but not all of it.


WHAT WERE THE SYMPTOMS?
---

The version of mod_dav_svn being used was 1.6.9.

A user got an error trying to commit one particular file, and also when
attempting to check out a fresh WC.  I don't have details of these.

Then 'svnadmin verify' was run on the repo, and revealed several corrupt
revisions, with the following three kinds of error:

  * svnadmin: Corrupt node-revision '5-12980.0.r12980/5571'
svnadmin: Found malformed header in revision file

  * svnadmin: Corrupt representation '13001 1496 2082 16645 [...]'
svnadmin: Malformed representation header

  * svnadmin: Reading one svndiff window read beyond the end of the
representation

There were dozens of the first kind, a few of the second kind and one or
two of the third kind.

The corrupt revisions were spread over a period of a few weeks, with no
corrupt revisions before that or after that.  We know of nothing special
about that time period.


ANALYSIS


I used both plain text searching and John Szakmeister's 'fsfsverify.py'
to help analyze the revision files.  Here are just the brief results of
what I found.

Most of the 'Corrupt node-revision' errors were due to the byte-offset
part of the node-rev id being wrong.  This error occurred with many
different node-rev ids.  A corrupt revision contained from one to
several ids with wrong byte-offsets.  Each particular node-rev id
appeared in several different revisions after the one in which it was
created, and it appeared correctly in some of them and wrongly in
others, with no discernable pattern.  Every time it appeared wrongly, it
had the same wrong value, so there were only two variants of each
node-rev id: the right one and the wrong one.  The byte-offset was
always fairly close to the correct value, but off by about 5 to 500
bytes.  The wrong byte-offset did not point to any special place in the
target revision file, such as the start or end of a data blob, so
svnadmin reported 'Found malformed header'.

One or two 'Corrupt node-revision' errors were wrong in another way.  A
directory entry reference to a subdirectory named 'X' (not its real
name) had the exact value 'dir 6-12953.0.r12953/30623'.  Exactly one of
the node-revs created in r12953 was named 'X', and it was a directory at
the right path, and its node-rev id was '0-12953.0.r12953/30403'.
Therefore I concluded that that is the correct replacement.  Note that
both the node-id component and the byte-offset part were wrong.

The 'Corrupt representation' errors were also due to a byte-offset being
wrong.  The second number, '1496' in the above example, is supposed to
be the byte-offset in the revision file.  Like the node-rev byte
offsets, these were typically off by a small amount.

I did not investigate or fix the 'Reading one svndiff window ...' error.


THE SCRIPT TO FIX THE ERRORS


Usage:
  ./fix-repo REPO-DIR START-REVNUM

Files (attached, separately and as .tgz):
  fix-repo# shell script, iterates over rev numbers; calls ...
  fixer/fix-rev.py# finds and fixes errors, using ...
  fixer/find_good_id.py   # looks up a node-rev id, ignoring offset
  fixer/__init__.py   # empty file, defines this as a Python module

When the script sees a 'Corrupt node-revision' error message, it looks
up the node-rev id ignoring its offset part.  If found, it substitutes
the correct full id wherever it occurs in the revision file.  It expects
this change to result in a checksum error being reported next, and so it
substitutes the calculated checksum as reported in the error message.
(In fact, it assumes that any checksum error being reported should be
simply corrected in this way.)

For the second type of 'Corrupt node-revision' error, I could not find a
simple rule to determine when a node-rev id was wrong in this way so I
hard-coded that one specific substitution into the script.

When the script sees a 'Corrupt representation' error, it searches for
all representations in the target revision and, if exactly one of them
has the expected length, it substitutes the offset of this one.


LIMITATIONS & IMPROVEMENTS
--

The script's algorithm is crude and could do with improvement in several
respects if it is to be used more widely.

It doesn't respect checksums.  When fixing a node-rev id, it should
update only the corresponding checksum rather than assuming that any
reported checksum error is the sole result of this fix.  When fixing a
representation offset, it should ensure the rep that it finds is in fact
the right one, probably by checking the checksum.

Detecting and fixing the second type of 'Corrupt node-revision' error
could probably be automated.

It doesn't replace a wrong byte-offset if the correct byte-offset has a
different number of digits.  I didn't encounter a need for this.  This
would be ve