Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-16 Thread Frans Pop
On Wednesday 03 June 2009, Frans Pop wrote:
 As you may remember we had a problem before the release of Lenny with
 the l10n-sync script running wild and creating an insanely large Danish
 PO file for sublevel 4.
 This was eventually corrected, but the commits increasing the size of
 that da.po master file to eventually 250MB (and the same again spread
 out the da.po files for several individual packages) are still there.

 These commits waste space on alioth and will also continue to cause
 problems, for example when people create a git-svn checkout [1].

I have done extensive testing and checks and am convinced there are no 
remaining issues with the cleanup method.

Unless there are strong objections I intend to perform the cleanup soon. 
I'll of course announce the date in advance; the repository will be 
unavailable for some time for commits, but I expect that will be less 
than 4 hours.

The result of the final cleanup method will be:
- SVN database will shrink, but only a small part is a result of the
  cleanup itself; mostly it is because the dump/load gets rid of cruft
  from old SVN versions;
- the cleanup will remove only broken l10n-sync commits and one
  incomplete early cleanup commit; no changes by users are lost or
  changed;
- the cause of the l10n-sync failure (broken PO file headers) is not
  removed, only the consequences (file corruption and extreme growth);
- these consequences are removed completely: after the cleanup the
  affected da.po files are all clean, except for the broken headers;
- tagged versions from uploads of affected packages remain identical
  to what was uploaded to the archive because (as part of the cleanup)
  the corruption at the time of the upload is made part of the tag.

The main advantages of the cleanup are:
- a cleaner and more useful revision history for the affected files and
  packages;
- reduced risk of issues during future uses of the repository, such as
  git-svn checkouts, revision analysis, repository backup, possible
  repository conversion.

Cheers,
FJP


signature.asc
Description: This is a digitally signed message part.


Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-05 Thread Christian Perrier
Quoting Frans Pop (elen...@planet.nl):
 As you may remember we had a problem before the release of Lenny with the 
 l10n-sync script running wild and creating an insanely large Danish PO 
 file for sublevel 4.


I can't comment deeply on your proposal, but I'd like to thank you for
taking care to repair that damage as much as possible, while it
occurred mostly because I was not attentive enough to commit logs.

I'm highly confident that you'll take all care needed to avoid
damaging the SVN so, the only thing I can really do, is wishing you
good luck and courage for that task that obviously need to be done
with grreat care. Again, thanks.




signature.asc
Description: Digital signature


Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-05 Thread Frans Pop
On Thursday 04 June 2009, Bastian Blank wrote:
  A tag is a copy, but the files are not actually copied. So if I
  change the file in trunk in a revision before the tag, the tagged
  version of the file will automatically change as well.

 You can change the file along with the copy operation.

Right, I see what you mean now. I've extended my awk script to do that:
- when a da.po file for a package is removed, it's also saved to a tmp
  dir, so I always have the latest version for each package available
- when I encounter a revision that creates a tag for the package, I read
  the last saved file back in (modifying the path to match the tag dir)

The cleaned dump file it creates looks good; next is testing if it loads 
and checking resulting revisions.

New version of script attached (for posterity and the curious :-)

 Another problem: does anyone use dumps with deltas? The hotbackup
 script bundled with subversion does this.

No idea, though the subversion-tools package description says it's for bdb 
based repos. If someone does I hope they speak up.

Cheers,
FJP

Output of the awk script

Tag r55973 found for cdebconf
   282 lines restored
Tag r56032 found for partman-efi
   96 lines restored
Tag r56062 found for nobootloader
   227 lines restored
Tag r56074 found for partman-target
   264 lines restored
Tag r56090 found for flash-kernel
   92 lines restored
Tag r56092 found for silo-installer
   224 lines restored
Tag r56094 found for partman-ext2r0
   267 lines restored
Tag r56119 found for partman-palo
   80 lines restored
Tag r56157 found for sibyl-installer
   101 lines restored
Tag r56160 found for arcboot-installer
   168 lines restored
Tag r56399 found for quik-installer
   454 lines restored
Tag r56402 found for prep-installer
   133 lines restored
Tag r56404 found for yaboot-installer
   355 lines restored
Tag r56406 found for partman-prep
   81 lines restored
Tag r56408 found for partman-newworld
   106 lines restored
Tag r56411 found for cdebconf
   282 lines restored
Tag r56825 found for cdebconf
   282 lines restored

BEGIN {
start = 1
clean = 0
infile = 0
save_trans = 0
restore_trans = 0
}

# Set limits of cleaning operation
/^Revision-number: 55934/ {
clean = 1
}
/^Revision-number: 57134/ {
clean = 0
}

# New revision; close previous one
/^Revision-number:/ {
rev = substr($0, 18)
infile = 0
if (save_trans == 1) {
close(pfile)
save_trans=0
}

# Restore last version of corrupted file for taged version
if  (restore_trans == 1) {
cnt = 0
# Skip first (blank) line
getline line pfile
while (getline line pfile) {
cnt = cnt + 1
if (line !~ /^Node-path:/) {
print line
} else {
print npath /debian/po/da.po
}
}
print cnt  lines restored /dev/stderr
close(pfile)
restore_trans = 0
}
}
# New file in current revision
/^Node-path:/ {
infile = 0
npath = $0
if (save_trans == 1) {
close(pfile)
save_trans = 0
}
}

# These are the files we want
/^Node-path: 
trunk.*\/(po\/sublevel4|cdebconf|nobootloader|flash-kernel|partman-(prep|newworld|target|ext2r0|efi|palo)|(silo|prep|quik|yaboot|sibyl|arcboot|vmelilo)-installer)\/.*da\.po/
 {
# Save a copy of the last version we encounter
if ($0 !~ /\/sublevel4\//) {
s = match($0, [^/]+/debian)
package = substr($0, s, RLENGTH - 7)
pfile = tmp/ package .sv
save_trans = 1
}
infile = 1
}

# We're tagging a cleaned package = restore the da.po file to the
# uploaded version (last saved cleaned instance from trunk)
/^Node-copyfrom-path: 
trunk.*\/(cdebconf|nobootloader|flash-kernel|partman-(prep|newworld|target|ext2r0|efi|palo)|(silo|prep|quik|yaboot|sibyl|arcboot|vmelilo)-installer)$/
 {
if (clean == 1) {
s = match($0, [^/]+$)
package = substr($0, s)
pfile = tmp/ package .sv
print Tag r rev  found for  package /dev/stderr
restore_trans = 1
}
}

# The prevline construction is needed because if we restore a translation
# that needs to be done before the extra newline that starts a new revision
/.*/ {
if (clean == 0 || infile == 0) {
if (start != 1) {
print prevline
}
} else if (save_trans == 1) {
print prevline pfile
}
start = 0
prevline = $0
}

END {
print prevline
}


Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-05 Thread Bastian Blank
On Wed, Jun 03, 2009 at 10:42:39PM +0200, Frans Pop wrote:
 As a result of the cleanup the 'svnadmin dump' file shrinks by more than 
 2GB (!) and the repository database shrinks from 2.4GB to 1.7GB.

A direct dump and load gives the following:

| wa...@alioth:~$ du -s /svn/d-i/db 
| 2448288 /svn/d-i/db
| wa...@alioth:~$ du -s debian/d-i/test/db 
| 1724144 debian/d-i/test/db
| wa...@alioth:~$ cat /svn/d-i/db/current
| 58721
| wa...@alioth:~$ cat debian/d-i/test/db/current
| 58721

Bastian

-- 
It is a human characteristic to love little animals, especially if
they're attractive in some way.
-- McCoy, The Trouble with Tribbles, stardate 4525.6


signature.asc
Description: Digital signature


Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-05 Thread Frans Pop
On Friday 05 June 2009, Bastian Blank wrote:
 On Wed, Jun 03, 2009 at 10:42:39PM +0200, Frans Pop wrote:
  As a result of the cleanup the 'svnadmin dump' file shrinks by more
  than 2GB (!) and the repository database shrinks from 2.4GB to 1.7GB.

 A direct dump and load gives the following:
 | wa...@alioth:~$ du -s /svn/d-i/db
 | 2448288 /svn/d-i/db
 | wa...@alioth:~$ du -s debian/d-i/test/db
 | 1724144 debian/d-i/test/db

Cleaned version (with tagged releases now identical to existing tags!):
$ du -s repo/db
1716912 repo/db
$ cat repo/db/current
58721

So not a major difference. Fairly logical as the errors are repeating and 
thus compress well.


-- 
To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-04 Thread Bastian Blank
On Wed, Jun 03, 2009 at 10:42:39PM +0200, Frans Pop wrote:
 The way my cleanup works is that I remove all changes to the affected 
 files made between revisions 55934 and 57133 (both inclusive).
 As a result of the cleanup the 'svnadmin dump' file shrinks by more than 
 2GB (!) and the repository database shrinks from 2.4GB to 1.7GB.

Which sizes did you compare? The d-i repo still includes plenty of
vdelta revisions from repository format = 3. A dump/load cycle should
reduce the size anyway.

 As a result of the cleanup, some revisions (24 in total) become empty as 
 no other files were changed in that commit, but subversion handles this 
 without problems: a diff against the previous revision just shows empty. 
 I'll modify the revision comment to explain this. I'll also modify the 
 comments for revisions that caused the problem and the (now very small) 
 cleanup commits to explain the issue.

Working copies with references to this revisions gets invalidated.

 Because of the way tagging in subversion works, it is not possible to do 
 the cleanup and still keep the tagged versions exactly as they were 
 uploaded (see below for affected package versions).

Please explain. A tag is just a copy, which can also include
modifications.

 Essentially: not.

This is incorrect. The effects are outlined in the Subversion FAQ and
references materials[1].

 If we are agreed, I will pick a day to do the actual cleanup. During part 
 of that day the repository will be blocked for commits.

There is not need to block anything. You can only change intermediate
revisions, so the top is not affected.

 BEGIN {
   clean = 0
   infile = 0
 }
[...]

I think you want svndumpfilter.

Bastian

[1]: http://subversion.tigris.org/faq.html#removal
-- 
Is truth not truth for all?
-- Natira, For the World is Hollow and I have Touched
   the Sky, stardate 5476.4.


signature.asc
Description: Digital signature


Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-04 Thread Frans Pop
Thanks a lot for the reply, Bastian.

On Thursday 04 June 2009, Bastian Blank wrote:
 On Wed, Jun 03, 2009 at 10:42:39PM +0200, Frans Pop wrote:
  The way my cleanup works is that I remove all changes to the affected
  files made between revisions 55934 and 57133 (both inclusive).
  As a result of the cleanup the 'svnadmin dump' file shrinks by more
  than 2GB (!) and the repository database shrinks from 2.4GB to 1.7GB.

 Which sizes did you compare? The d-i repo still includes plenty of

Current database versus reloaded cleaned database.

 vdelta revisions from repository format = 3. A dump/load cycle should
 reduce the size anyway.

Ah, that is possible. The other advantages remain though.

 Working copies with references to this revisions gets invalidated.

Hmm, yes that could be. Did not consider that.
But what risk is there that there _are_ (m)any working copies that 
reference those revisions? The last commit I change was 08-01-2009, so 
most users should have 'svn updated' by now.

Hmm. I guess some translators who worked on their translations in that 
period and haven't been active since could have such a checkout. 

OK. I'll test that and if it is a problem we'll have to warn about it.
I don't think it's a huge problem if such users would have to do a new 
checkout.

  Because of the way tagging in subversion works, it is not possible to
  do the cleanup and still keep the tagged versions exactly as they
  were uploaded (see below for affected package versions).

 Please explain. A tag is just a copy, which can also include
 modifications.

A tag is a copy, but the files are not actually copied. So if I change the 
file in trunk in a revision before the tag, the tagged version of the 
file will automatically change as well.

  Essentially: not.

 This is incorrect. The effects are outlined in the Subversion FAQ and
 references materials[1].

There does not seem anything there other than what we've already covered. 
We don't lose any revisions and all revisions + the state of HEAD remain 
completely identical to the current database.

  If we are agreed, I will pick a day to do the actual cleanup. During
  part of that day the repository will be blocked for commits.

 There is not need to block anything. You can only change intermediate
 revisions, so the top is not affected.

I don't see how I could manipulate intermediate revs without rebuilding 
the database from the bottom up. What exact procedure are you referring 
to?

Blocking the repo for a few hours shouldn't be a major inconvenience 
anyway. It's not like we have a high commit rate ATM.

  BEGIN {
  clean = 0
  infile = 0
  }

 [...]

 I think you want svndumpfilter.

I read about that, but I don't think it does what we need here: it only 
filters paths, not specific commits [1]. Anyway, my awk script is already 
there and I've tested that it does exactly what I want it to do.
My cleaned dump file loads without any problems and I've done fairly 
extensive checks with svnlook that the database is as it should be after 
the load.

Despite the warnings, the dumpfile format is relatively straightforward 
(and I did not use --incremental for my dump on purpose).

Thanks again,
FJP

[1] Hmm. Guess it could maybe be used, but I'd need to create a dumpfile 
for exactly the range to be cleaned and it would need to be run 
separately for each file to be excluded.


-- 
To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-04 Thread peter green



3) The relevant versions are now no longer available anywhere [2]: they
   are no longer in the archive and we don't have a snapshot.d.n for that
   period.
I don't think this statement is correct. snapshot.debian.net seems to 
have all dates up to and including 2009/03/28, that date is after the 
release of lenny rc1 afaict.


Note: the search function on snapshot.debian.net stopped updating long 
before the actual archiving stopped, and also 2009 doesn't appear in the 
index of archives (but the early 2009 stuff is accessible through 
manually typing urls) don't display properly.



--
To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-04 Thread Frans Pop
On Thursday 04 June 2009, peter green wrote:
  3) The relevant versions are now no longer available anywhere [2]:
  they are no longer in the archive and we don't have a snapshot.d.n
  for that period.

 I don't think this statement is correct. snapshot.debian.net seems to
 have all dates up to and including 2009/03/28, that date is after the
 release of lenny rc1 afaict.

Last time I checked, and that was quite some time ago, I thought it had 
stopped updating completely. But it looks like you're correct and the 
affected versions are available.

After I sent the mail I decided that it would be a good idea to export the 
currently tagged versions and keep them separately on alioth somewhere, 
so that would cover that.

I still don't think it's a major issue, but thanks for the correction.

/me wonders when we'll be getting the long-promised snapshot.d.o...


-- 
To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-04 Thread Frans Pop
On Thursday 04 June 2009, Frans Pop wrote:
 On Thursday 04 June 2009, Bastian Blank wrote:
  Working copies with references to this revisions gets invalidated.

 OK. I'll test that and if it is a problem we'll have to warn about it.
 I don't think it's a huge problem if such users would have to do a new
 checkout.

I've tested that now and it is indeed an issue.

I installed the cleaned repo on a local server and then did a checkout of 
revision 56250 (in middle of cleanup) from the official repo. I then 
relocated the checkout to the local cleaned up repo and ran an svn up.

Result was that I got checksum mismatch errors for the affected da.po 
files.

But there's also a simple workaround. Just delete the parent directory of 
the damaged files, and svn will refetch that whole directory and 
continue happily.
So users only have to delete selected packages/package/debian/po dirs to 
repair the damage.


-- 
To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-04 Thread Bastian Blank
On Thu, Jun 04, 2009 at 11:28:06AM +0200, Frans Pop wrote:
 On Thursday 04 June 2009, Bastian Blank wrote:
  On Wed, Jun 03, 2009 at 10:42:39PM +0200, Frans Pop wrote:
   The way my cleanup works is that I remove all changes to the affected
   files made between revisions 55934 and 57133 (both inclusive).
   As a result of the cleanup the 'svnadmin dump' file shrinks by more
   than 2GB (!) and the repository database shrinks from 2.4GB to 1.7GB.
 
  Which sizes did you compare? The d-i repo still includes plenty of
 Current database versus reloaded cleaned database.
  vdelta revisions from repository format = 3. A dump/load cycle should
  reduce the size anyway.
 Ah, that is possible. The other advantages remain though.

An easy estimate is the file size of the affected revisions in db/revs.

  Working copies with references to this revisions gets invalidated.
 Hmm, yes that could be. Did not consider that.
 But what risk is there that there _are_ (m)any working copies that 
 reference those revisions? The last commit I change was 08-01-2009, so 
 most users should have 'svn updated' by now.

Really low, and the workaround is to remove the broken directories.

   Because of the way tagging in subversion works, it is not possible to
   do the cleanup and still keep the tagged versions exactly as they
   were uploaded (see below for affected package versions).
  Please explain. A tag is just a copy, which can also include
  modifications.
 A tag is a copy, but the files are not actually copied. So if I change the 
 file in trunk in a revision before the tag, the tagged version of the 
 file will automatically change as well.

You can change the file along with the copy operation.

   If we are agreed, I will pick a day to do the actual cleanup. During
   part of that day the repository will be blocked for commits.
  There is not need to block anything. You can only change intermediate
  revisions, so the top is not affected.
 I don't see how I could manipulate intermediate revs without rebuilding 
 the database from the bottom up. What exact procedure are you referring 
 to?

I thought again and realized that the internal ids will not permit this.

Another problem: does anyone use dumps with deltas? The hotbackup script
bundled with subversion does this.

Bastian

-- 
Where there's no emotion, there's no motive for violence.
-- Spock, Dagger of the Mind, stardate 2715.1


signature.asc
Description: Digital signature