Re: svnsync checksum error

2010-11-11 Thread Stefan Sperling
On Wed, Nov 10, 2010 at 08:55:49PM -0500, Edward Ned Harvey wrote:
  From: Stefan Sperling [mailto:s...@elego.de]
  
   It's 100% consistent.  I get the same checksum error, on the same file,
  every time.  I have a supposed good copy of the slave repo, at rev
 4050...
  which will fail every time at 4061 (or something like that)...  The only
  explanation I can find is a md5sum collision going undetected, and then
 some
  larger operation has an md5sum which fails as a result.  I know it's
  astronomically impossible, but I can't come up with any other explanation.
  
  So you can reproduce it reliably? That's very interesting.
  I'd like to try to debug this. If it's possible to arrange access to your
 repository
  data please contact me off-list. Thanks.
 
 I believe we found the cause for mine.  It was hardware error, which was
 introduced silently into rev 4390 of my repo.  But I can't speak for the
 other folks here...  If they're having bugs, they might have bugs.
 
 One quick question though:  If the system is calculating checksums,
 shouldn't it store the checksums for future reference?  I find it very
 surprising that I can run svnadmin verify and no errors are detected, yet
 svnsync dies with a md5sum mismatch.  Maybe the md5sums are only used
 transiently and only by svnsync?

By design, the handling of checksums is sane.
Checksums are stored in the repository, and are calculated by the
repository layer. A client can only tell the repository what it expects
the checksum to be. When the client sends content, the repository
calculates the content's checksum and compares that to the expected
checksum. See also http://svn.haxx.se/dev/archive-2010-07/0426.shtml,
where Mike Pilato explains this in detail. [That thread is about a commit
I made that added property content checksums to dump files. The commit was
later reverted because it's just as cheap to compare the actual property
contents since we treat property content as strings (but the loader
doesn't do that, yet).]

I'm not sure what svnadmin verify is doing wrong in your case.
But I know that there are corruptions it doesn't detect, and we're
planning to improve this situation:
http://subversion.tigris.org/issues/show_bug.cgi?id=3706


Re: svnsync checksum error

2010-11-11 Thread Daniel Shahaf
Stefan Sperling wrote on Thu, Nov 11, 2010 at 13:29:14 +0100:
 I'm not sure what svnadmin verify is doing wrong in your case.
 But I know that there are corruptions it doesn't detect, and we're
 planning to improve this situation:
 http://subversion.tigris.org/issues/show_bug.cgi?id=3706

What's the recommendation in the meantime?  To use 'dump'?  To use
'dump|load'?  To use svnsync (...)?  To manually walk the entire
history and recompute all checksums?


Re: svnsync checksum error

2010-11-11 Thread Stefan Sperling
On Thu, Nov 11, 2010 at 03:10:19PM +0200, Daniel Shahaf wrote:
 Stefan Sperling wrote on Thu, Nov 11, 2010 at 13:29:14 +0100:
  I'm not sure what svnadmin verify is doing wrong in your case.
  But I know that there are corruptions it doesn't detect, and we're
  planning to improve this situation:
  http://subversion.tigris.org/issues/show_bug.cgi?id=3706
 
 What's the recommendation in the meantime?  To use 'dump'?  To use
 'dump|load'?  To use svnsync (...)?  To manually walk the entire
 history and recompute all checksums?

I'd recommend using fsfs-verify.py.


RE: svnsync checksum error

2010-11-11 Thread Edward Ned Harvey
 From: Stefan Sperling [mailto:s...@elego.de]
 
 By design, the handling of checksums is sane.
 Checksums are stored in the repository, and are calculated by the
repository
 layer. A client can only tell the repository what it expects the checksum
to be.
 When the client sends content, the repository calculates the content's
 checksum and compares that to the expected checksum. See also
 http://svn.haxx.se/dev/archive-2010-07/0426.shtml,
 where Mike Pilato explains this in detail. [That thread is about a commit
I
 made that added property content checksums to dump files. The commit
 was later reverted because it's just as cheap to compare the actual
property
 contents since we treat property content as strings (but the loader
doesn't
 do that, yet).]
 
 I'm not sure what svnadmin verify is doing wrong in your case.
 But I know that there are corruptions it doesn't detect, and we're
planning to
 improve this situation:
 http://subversion.tigris.org/issues/show_bug.cgi?id=3706

Actually, there is another option.  Perhaps svnadmin verify is doing
exactly the right thing ... checksums are stored in the repo, it calculates
checksums and verifies them all.  Perhaps it's right.  Perhaps we're only
*assuming* the corruption is at the slave side, while the corruption is
actually at the master.  The only thing I know for sure is that there's a
checksum mismatch between master  slave, for a specific file, beginning at
a specific rev...  Maybe the master is the one who's wrong.

Problem is, I don't know of any way to check, and determine which side is
wrong.  It's very labor intensive to checkout all the revs from the slave,
and from the master, and diff them all, to see if any other files are
corrupt.  But that is my plan, if I can't come up with a better idea.



Re: svnsync checksum error

2010-11-10 Thread Daniel Shahaf
OSG wrote on Tue, Nov 09, 2010 at 20:58:53 -0600:
 On 11/09/2010 06:41 PM, Daniel Shahaf wrote:
  Edward Ned Harvey wrote on Sat, Nov 06, 2010 at 20:29:18 -0400:
  From: opensrcguru [mailto:opensrcg...@gmail.com]
 
  Today, the sync process started failing on 1 repo (all others were
  unaffected) on both r/o copies at the exact same time/same revision
  with errors similar to the following...
 
  Transmitting file data .svnsync: Base checksum mismatch on
  '/path/to/file/foo/bar':
 expected:  2f2e025c4c4855e7466799a877b3e23d
   actual:  272214b9518d352e16e7eeceeb22f573
 
  
  Can you compare the contents of /path/to/file/foo/bar between the master
  and mirror, as of the last revision successfully synced to the mirror?
 Yes, I had done that and yes, the last sync'd revs were in tact and accurate.
 

So they are textually identical?  Can you compare their checksums to the
two checksums in the error message?

  If you create a fresh mirror and svnsync it, from r0 to that revision,
  does the file /path/to/file/foo/bar in the fresh mirror differ from the
  one in the master?
 No, a resync from r0 to current does not result in any differences.
 

Meaning, a fresh resync is successful and doesn't cause any error messages?

Or meaning, it results in the same error messages as before?


Re: svnsync checksum error

2010-11-10 Thread 'Daniel Shahaf'
Edward Ned Harvey wrote on Wed, Nov 10, 2010 at 00:28:48 -0500:
  From: Daniel Shahaf [mailto:d...@daniel.shahaf.name]
  
  Can you compare the contents of /path/to/file/foo/bar between the master
  and mirror, as of the last revision successfully synced to the mirror?
 
 The latest rev which synced without reporting any error was 5045.  It was
 trying to go from 5045 to 5046 when it triggered the checksum failure.
 
 I checked the history of the file in question, and it was changed in ~200
 different revs.  But the revs of interest are:  in 4390, it synced to the
 slave without reporting any error, however, from 4390 onward, if I checkout
 from the slave and master, the two files differ.  And the next rev where
 this file was changed was 5046, which is when svnsync notices the checksum
 mismatch, and dies.
 

Okay.

 It would seem, all of this behavior could be explained by a simple
 undetected hardware error.  During sync of 4390, the slave wrote some bits
 to disk, which got written wrongly.  It is known that disks will do this
 rarely.  This is one of the huge arguments in favor of ZFS and BTRFS and
 filesystem checksumming in general.  Such filesystems detect and correct
 data corruption which would have otherwise passed silently...  Which seems
 to be what happened in my case.
 

Yes, the question is whether this thread is just a bunch of hardware
errors, or something deeper.

 All servers and clients are running 1.6.12.  However, at the time when 4390
 was committed...  The master was 1.6.12, but the slave was probably 1.5.7
 
 
  If you create a fresh mirror and svnsync it, from r0 to that revision,
  does the
  file /path/to/file/foo/bar in the fresh mirror differ from the one in the
  master?
 
 No problems.  Although ... I didn't let it sync from rev 0.  (That would be
 impossibly time consuming...  weeks)  I did as mentioned before.
 Transferred a backup of the master to the slave, and used it as the seed
 for the sync, so I only needed to sync the last 100 revs or something like
 that...
 

That would mean that the last changed revision --- r4390 --- is
contained in the seed and wasn't re-svnsync'd.  If we suspect that
svnsync committed a bogus r4390 to the slave, we'd better start with
a slave that /doesn't/ already have a knowingly-good r4390...

Of course, you can take that backup and use it to produce a repository
whose youngest revision is earlier than r4390.




Re: svnsync checksum error

2010-11-10 Thread opensrcguru
On Wed, Nov 10, 2010 at 10:49 AM, Daniel Shahaf d...@daniel.shahaf.name wrote:
 OSG wrote on Tue, Nov 09, 2010 at 20:58:53 -0600:
 On 11/09/2010 06:41 PM, Daniel Shahaf wrote:
  Edward Ned Harvey wrote on Sat, Nov 06, 2010 at 20:29:18 -0400:
  From: opensrcguru [mailto:opensrcg...@gmail.com]
 
  Today, the sync process started failing on 1 repo (all others were
  unaffected) on both r/o copies at the exact same time/same revision
  with errors similar to the following...
 
  Transmitting file data .svnsync: Base checksum mismatch on
  '/path/to/file/foo/bar':
     expected:  2f2e025c4c4855e7466799a877b3e23d
       actual:  272214b9518d352e16e7eeceeb22f573
 
 
  Can you compare the contents of /path/to/file/foo/bar between the master
  and mirror, as of the last revision successfully synced to the mirror?
 Yes, I had done that and yes, the last sync'd revs were in tact and accurate.


 So they are textually identical?
Yes.

 Can you compare their checksums to the two checksums in the error message?
I hadn't yet, but I can. What is being used to perform the sum (md5/sha1/???)?

  If you create a fresh mirror and svnsync it, from r0 to that revision,
  does the file /path/to/file/foo/bar in the fresh mirror differ from the
  one in the master?
 No, a resync from r0 to current does not result in any differences.


 Meaning, a fresh resync is successful and doesn't cause any error messages?

 Or meaning, it results in the same error messages as before?


Correct. A new/fresh resync from r0 (including the previously troubled
revision) to latest completes successfully with no errors. That
process was the last in my troubleshooting process and is how I worked
around the problem.

--

In my case, I do not believe it to be hardware related because I had
two r/o copies that exhibited the same behavior at the same rev at the
same time. That is, unless there was a hardware issue on the source
copy. Although possible, pretty unlikely.


Re: svnsync checksum error

2010-11-10 Thread Les Mikesell

On 11/10/2010 1:39 PM, opensrcguru wrote:


Correct. A new/fresh resync from r0 (including the previously troubled
revision) to latest completes successfully with no errors. That
process was the last in my troubleshooting process and is how I worked
around the problem.

--

In my case, I do not believe it to be hardware related because I had
two r/o copies that exhibited the same behavior at the same rev at the
same time. That is, unless there was a hardware issue on the source
copy. Although possible, pretty unlikely.


I was able to fix mine by dumping up to a revision before the last few 
changes to the file with the error, loading that back and tweaking the 
properties that tell svnsync where to continue.  I agree that a hardware 
error is pretty unlikely here.  In my case it was a large zip file where 
the problem happened. Is there any chance there could have been a 
problem in the binary diff computation in a 1.6.x release version?  I'm 
not exactly sure what version would have been running when the error 
happened but I copied things over to a machine with 1.6.13 for the 
repair and it did not duplicate the problem.


--
  Les Mikesell
lesmikes...@gmail.com



Re: svnsync checksum error

2010-11-10 Thread Stefan Sperling
On Sun, Nov 07, 2010 at 12:48:01PM -0500, Edward Ned Harvey wrote:
 I do think it's a bug, but I was never able  to find enough info to make it 
 into a bug report.  I kept all the good  bad versions of the repository...  
 I ran the svnadmin verify all over the place (which is enormously time 
 consuming) ... svnadmin dump | svnadmin load ... Everything I can think of.  
 Never got any error in any way, except by repeating the svnsync from the 
 master.

I think it's a bug, too.
We (elego) have seen this svnsync checksum error at a customer site, too.
Never figured out how to reproduce it.

 It's 100% consistent.  I get the same checksum error, on the same file, every 
 time.  I have a supposed good copy of the slave repo, at rev 4050... which 
 will fail every time at 4061 (or something like that)...  The only 
 explanation I can find is a md5sum collision going undetected, and then some 
 larger operation has an md5sum which fails as a result.  I know it's 
 astronomically impossible, but I can't come up with any other explanation.

So you can reproduce it reliably? That's very interesting.
I'd like to try to debug this. If it's possible to arrange access to
your repository data please contact me off-list. Thanks.

Stefan


RE: svnsync checksum error

2010-11-10 Thread Edward Ned Harvey
 From: Stefan Sperling [mailto:s...@elego.de]
 
  It's 100% consistent.  I get the same checksum error, on the same file,
 every time.  I have a supposed good copy of the slave repo, at rev
4050...
 which will fail every time at 4061 (or something like that)...  The only
 explanation I can find is a md5sum collision going undetected, and then
some
 larger operation has an md5sum which fails as a result.  I know it's
 astronomically impossible, but I can't come up with any other explanation.
 
 So you can reproduce it reliably? That's very interesting.
 I'd like to try to debug this. If it's possible to arrange access to your
repository
 data please contact me off-list. Thanks.

I believe we found the cause for mine.  It was hardware error, which was
introduced silently into rev 4390 of my repo.  But I can't speak for the
other folks here...  If they're having bugs, they might have bugs.

One quick question though:  If the system is calculating checksums,
shouldn't it store the checksums for future reference?  I find it very
surprising that I can run svnadmin verify and no errors are detected, yet
svnsync dies with a md5sum mismatch.  Maybe the md5sums are only used
transiently and only by svnsync?



Re: svnsync checksum error

2010-11-09 Thread Daniel Shahaf
Edward Ned Harvey wrote on Sat, Nov 06, 2010 at 20:29:18 -0400:
  From: opensrcguru [mailto:opensrcg...@gmail.com]
  
  Today, the sync process started failing on 1 repo (all others were
  unaffected) on both r/o copies at the exact same time/same revision
  with errors similar to the following...
  
  Transmitting file data .svnsync: Base checksum mismatch on
  '/path/to/file/foo/bar':
 expected:  2f2e025c4c4855e7466799a877b3e23d
   actual:  272214b9518d352e16e7eeceeb22f573
 

Can you compare the contents of /path/to/file/foo/bar between the master
and mirror, as of the last revision successfully synced to the mirror?

If you create a fresh mirror and svnsync it, from r0 to that revision,
does the file /path/to/file/foo/bar in the fresh mirror differ from the
one in the master?

What versions of everything are you using?

What format are the repositories?  (What are the contents of the files
$REPOS_DIR/db/fs-type and $REPOS_DIR/db/format?)

 I recently had the same problem.  I never found any cause for it, but
 I did manage to deal with it somewhat better than you did.  On the
 master, I did svnadmin hotcopy, then I tarred up the backup and sent
 it to the slave, and extracted it.  I had to configure the slave hook
 scripts, and the revprop rev 0 properties, and then I was able to
 svnsync to the slave again.  The main point of difference ... No need
 to wait for 65k commits to transfer.  Since it's starting from
 a recent backup, it's enormously faster.
 
 
 


RE: svnsync checksum error

2010-11-09 Thread Edward Ned Harvey
 From: Daniel Shahaf [mailto:d...@daniel.shahaf.name]
 
 Can you compare the contents of /path/to/file/foo/bar between the master
 and mirror, as of the last revision successfully synced to the mirror?

The latest rev which synced without reporting any error was 5045.  It was
trying to go from 5045 to 5046 when it triggered the checksum failure.

I checked the history of the file in question, and it was changed in ~200
different revs.  But the revs of interest are:  in 4390, it synced to the
slave without reporting any error, however, from 4390 onward, if I checkout
from the slave and master, the two files differ.  And the next rev where
this file was changed was 5046, which is when svnsync notices the checksum
mismatch, and dies.

It would seem, all of this behavior could be explained by a simple
undetected hardware error.  During sync of 4390, the slave wrote some bits
to disk, which got written wrongly.  It is known that disks will do this
rarely.  This is one of the huge arguments in favor of ZFS and BTRFS and
filesystem checksumming in general.  Such filesystems detect and correct
data corruption which would have otherwise passed silently...  Which seems
to be what happened in my case.

All servers and clients are running 1.6.12.  However, at the time when 4390
was committed...  The master was 1.6.12, but the slave was probably 1.5.7


 If you create a fresh mirror and svnsync it, from r0 to that revision,
does the
 file /path/to/file/foo/bar in the fresh mirror differ from the one in the
 master?

No problems.  Although ... I didn't let it sync from rev 0.  (That would be
impossibly time consuming...  weeks)  I did as mentioned before.
Transferred a backup of the master to the slave, and used it as the seed
for the sync, so I only needed to sync the last 100 revs or something like
that...



RE: svnsync checksum error

2010-11-07 Thread Edward Ned Harvey
 From: opensrcguru [mailto:opensrcg...@gmail.com]
 
 Today, the sync process started failing on 1 repo (all others were
 unaffected) on both r/o copies at the exact same time/same revision
 with errors similar to the following...
 
 Transmitting file data .svnsync: Base checksum mismatch on
 '/path/to/file/foo/bar':
expected:  2f2e025c4c4855e7466799a877b3e23d
  actual:  272214b9518d352e16e7eeceeb22f573

I recently had the same problem.  I never found any cause for it, but I did 
manage to deal with it somewhat better than you did.  On the master, I did 
svnadmin hotcopy, then I tarred up the backup and sent it to the slave, and 
extracted it.  I had to configure the slave hook scripts, and the revprop rev 0 
properties, and then I was able to svnsync to the slave again.  The main point 
of difference ... No need to wait for 65k commits to transfer.  Since it's 
starting from a recent backup, it's enormously faster.





RE: svnsync checksum error

2010-11-07 Thread Edward Ned Harvey
 -Original Message-
 From: Terry Inzauro [mailto:opensrcg...@gmail.com]
 
 I've found a handful of other cases similar to ours. Do you think a bug report
 is warranted or is this unique to our configurations?

I do think it's a bug, but I was never able  to find enough info to make it 
into a bug report.  I kept all the good  bad versions of the repository...  I 
ran the svnadmin verify all over the place (which is enormously time 
consuming) ... svnadmin dump | svnadmin load ... Everything I can think of.  
Never got any error in any way, except by repeating the svnsync from the master.

It's 100% consistent.  I get the same checksum error, on the same file, every 
time.  I have a supposed good copy of the slave repo, at rev 4050... which 
will fail every time at 4061 (or something like that)...  The only explanation 
I can find is a md5sum collision going undetected, and then some larger 
operation has an md5sum which fails as a result.  I know it's astronomically 
impossible, but I can't come up with any other explanation.



Re: svnsync checksum error

2010-11-06 Thread Terry Inzauro
On 11/06/2010 07:29 PM, Edward Ned Harvey wrote:
 From: opensrcguru [mailto:opensrcg...@gmail.com]

 Today, the sync process started failing on 1 repo (all others were
 unaffected) on both r/o copies at the exact same time/same revision
 with errors similar to the following...

 Transmitting file data .svnsync: Base checksum mismatch on
 '/path/to/file/foo/bar':
expected:  2f2e025c4c4855e7466799a877b3e23d
  actual:  272214b9518d352e16e7eeceeb22f573
 
 I recently had the same problem.  I never found any cause for it, but I did 
 manage to deal with it somewhat better than you did.  On the master, I did 
 svnadmin hotcopy, then I tarred up the backup and sent it to the slave, and 
 extracted it.  I had to configure the slave hook scripts, and the revprop rev 
 0 properties, and then I was able to svnsync to the slave again.  The main 
 point of difference ... No need to wait for 65k commits to transfer.  Since 
 it's starting from a recent backup, it's enormously faster.
 


Yes, that sounds  quite a bit easier/quicker.  I didn't realise the r/o copies 
maintained by svnsync were that similar to the
r/w copies they get their data from.  Thank you for the information.

I've found a handful of other cases similar to ours. Do you think a bug report 
is warranted or is this unique to our
configurations?


kind regards,

OSG



svnsync checksum error

2010-11-05 Thread opensrcguru
List,

I've got about 20 repos that have been successfully syncing (with
svnsync) to two read only copies for a few months. The r/w copy and
both r/o copies are located on a local LAN (different subnets
separated by firewalls).

Today, the sync process started failing on 1 repo (all others were
unaffected) on both r/o copies at the exact same time/same revision
with errors similar to the following...

Transmitting file data .svnsync: Base checksum mismatch on
'/path/to/file/foo/bar':
   expected:  2f2e025c4c4855e7466799a877b3e23d
 actual:  272214b9518d352e16e7eeceeb22f573

I successfully removed the uncommitted transactions (svnadmin rmtxns
reponame `svnadmin lstxns reponame`) and attempted the  re-sync,  to
no avail.

svnadmin verify returned no errors

I ended up  re-creating the r/o repo and then re-syncing all 65k
commits to the repos (which takes a while...)

Software binaries from Collabnet:
r/w version = svn/svnsync, version 1.6.13 (r1002816)
r/o 1 version = svn/svnsync, version 1.6.13 (r1002816)
r/o 2 version = svn/svnsync, version 1.6.13 (r1002816)

Is there a better approach to resolving the issue
Am I running into a known issue?


Any help/insight would be greatly appreciated.


OSG