Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository
On Wednesday 03 June 2009, Frans Pop wrote: > As you may remember we had a problem before the release of Lenny with > the l10n-sync script running wild and creating an insanely large Danish > PO file for sublevel 4. > This was eventually corrected, but the commits increasing the size of > that da.po master file to eventually 250MB (and the same again spread > out the da.po files for several individual packages) are still there. > > These commits waste space on alioth and will also continue to cause > problems, for example when people create a git-svn checkout [1]. I have done extensive testing and checks and am convinced there are no remaining issues with the cleanup method. Unless there are strong objections I intend to perform the cleanup soon. I'll of course announce the date in advance; the repository will be unavailable for some time for commits, but I expect that will be less than 4 hours. The result of the final cleanup method will be: - SVN database will shrink, but only a small part is a result of the cleanup itself; mostly it is because the dump/load gets rid of cruft from old SVN versions; - the cleanup will remove only broken l10n-sync commits and one incomplete early cleanup commit; no changes by users are lost or changed; - the cause of the l10n-sync failure (broken PO file headers) is not removed, only the consequences (file corruption and extreme growth); - these consequences are removed completely: after the cleanup the affected da.po files are all "clean", except for the broken headers; - tagged versions from uploads of affected packages remain identical to what was uploaded to the archive because (as part of the cleanup) the corruption at the time of the upload is made part of the tag. The main advantages of the cleanup are: - a cleaner and more useful revision history for the affected files and packages; - reduced risk of issues during future uses of the repository, such as git-svn checkouts, revision analysis, repository backup, possible repository conversion. Cheers, FJP signature.asc Description: This is a digitally signed message part.
Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository
On Friday 05 June 2009, Bastian Blank wrote: > On Wed, Jun 03, 2009 at 10:42:39PM +0200, Frans Pop wrote: > > As a result of the cleanup the 'svnadmin dump' file shrinks by more > > than 2GB (!) and the repository database shrinks from 2.4GB to 1.7GB. > > A direct dump and load gives the following: > | wa...@alioth:~$ du -s /svn/d-i/db > | 2448288 /svn/d-i/db > | wa...@alioth:~$ du -s debian/d-i/test/db > | 1724144 debian/d-i/test/db Cleaned version (with tagged releases now identical to existing tags!): $ du -s repo/db 1716912 repo/db $ cat repo/db/current 58721 So not a major difference. Fairly logical as the errors are repeating and thus compress well. -- To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository
On Wed, Jun 03, 2009 at 10:42:39PM +0200, Frans Pop wrote: > As a result of the cleanup the 'svnadmin dump' file shrinks by more than > 2GB (!) and the repository database shrinks from 2.4GB to 1.7GB. A direct dump and load gives the following: | wa...@alioth:~$ du -s /svn/d-i/db | 2448288 /svn/d-i/db | wa...@alioth:~$ du -s debian/d-i/test/db | 1724144 debian/d-i/test/db | wa...@alioth:~$ cat /svn/d-i/db/current | 58721 | wa...@alioth:~$ cat debian/d-i/test/db/current | 58721 Bastian -- It is a human characteristic to love little animals, especially if they're attractive in some way. -- McCoy, "The Trouble with Tribbles", stardate 4525.6 signature.asc Description: Digital signature
Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository
On Thursday 04 June 2009, Bastian Blank wrote: > > A tag is a copy, but the files are not actually copied. So if I > > change the file in trunk in a revision before the tag, the tagged > > version of the file will automatically change as well. > > You can change the file along with the copy operation. Right, I see what you mean now. I've extended my awk script to do that: - when a da.po file for a package is removed, it's also saved to a tmp dir, so I always have the latest version for each package available - when I encounter a revision that creates a tag for the package, I read the last saved file back in (modifying the path to match the tag dir) The cleaned dump file it creates looks good; next is testing if it loads and checking resulting revisions. New version of script attached (for posterity and the curious :-) > Another problem: does anyone use dumps with deltas? The hotbackup > script bundled with subversion does this. No idea, though the subversion-tools package description says it's for bdb based repos. If someone does I hope they speak up. Cheers, FJP Output of the awk script Tag r55973 found for cdebconf 282 lines restored Tag r56032 found for partman-efi 96 lines restored Tag r56062 found for nobootloader 227 lines restored Tag r56074 found for partman-target 264 lines restored Tag r56090 found for flash-kernel 92 lines restored Tag r56092 found for silo-installer 224 lines restored Tag r56094 found for partman-ext2r0 267 lines restored Tag r56119 found for partman-palo 80 lines restored Tag r56157 found for sibyl-installer 101 lines restored Tag r56160 found for arcboot-installer 168 lines restored Tag r56399 found for quik-installer 454 lines restored Tag r56402 found for prep-installer 133 lines restored Tag r56404 found for yaboot-installer 355 lines restored Tag r56406 found for partman-prep 81 lines restored Tag r56408 found for partman-newworld 106 lines restored Tag r56411 found for cdebconf 282 lines restored Tag r56825 found for cdebconf 282 lines restored BEGIN { start = 1 clean = 0 infile = 0 save_trans = 0 restore_trans = 0 } # Set limits of cleaning operation /^Revision-number: 55934/ { clean = 1 } /^Revision-number: 57134/ { clean = 0 } # New revision; close previous one /^Revision-number:/ { rev = substr($0, 18) infile = 0 if (save_trans == 1) { close(pfile) save_trans=0 } # Restore last version of corrupted file for taged version if (restore_trans == 1) { cnt = 0 # Skip first (blank) line getline line "/dev/stderr" close(pfile) restore_trans = 0 } } # New file in current revision /^Node-path:/ { infile = 0 npath = $0 if (save_trans == 1) { close(pfile) save_trans = 0 } } # These are the files we want /^Node-path: trunk.*\/(po\/sublevel4|cdebconf|nobootloader|flash-kernel|partman-(prep|newworld|target|ext2r0|efi|palo)|(silo|prep|quik|yaboot|sibyl|arcboot|vmelilo)-installer)\/.*da\.po/ { # Save a copy of the last version we encounter if ($0 !~ /\/sublevel4\//) { s = match($0, "[^/]+/debian") package = substr($0, s, RLENGTH - 7) pfile = "tmp/" package ".sv" save_trans = 1 } infile = 1 } # We're tagging a cleaned package => restore the da.po file to the # uploaded version (last saved "cleaned" instance from trunk) /^Node-copyfrom-path: trunk.*\/(cdebconf|nobootloader|flash-kernel|partman-(prep|newworld|target|ext2r0|efi|palo)|(silo|prep|quik|yaboot|sibyl|arcboot|vmelilo)-installer)$/ { if (clean == 1) { s = match($0, "[^/]+$") package = substr($0, s) pfile = "tmp/" package ".sv" print "Tag r" rev " found for " package >"/dev/stderr" restore_trans = 1 } } # The prevline construction is needed because if we restore a translation # that needs to be done before the extra newline that starts a new revision /.*/ { if (clean == 0 || infile == 0) { if (start != 1) { print prevline } } else if (save_trans == 1) { print prevline >pfile } start = 0 prevline = $0 } END { print prevline }
Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository
Quoting Frans Pop (elen...@planet.nl): > As you may remember we had a problem before the release of Lenny with the > l10n-sync script running wild and creating an insanely large Danish PO > file for sublevel 4. I can't comment deeply on your proposal, but I'd like to thank you for taking care to repair that damage as much as possible, while it occurred mostly because I was not attentive enough to commit logs. I'm highly confident that you'll take all care needed to avoid damaging the SVN so, the only thing I can really do, is wishing you good luck and courage for that task that obviously need to be done with grreat care. Again, thanks. signature.asc Description: Digital signature
Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository
On Thu, Jun 04, 2009 at 11:28:06AM +0200, Frans Pop wrote: > On Thursday 04 June 2009, Bastian Blank wrote: > > On Wed, Jun 03, 2009 at 10:42:39PM +0200, Frans Pop wrote: > > > The way my cleanup works is that I remove all changes to the affected > > > files made between revisions 55934 and 57133 (both inclusive). > > > As a result of the cleanup the 'svnadmin dump' file shrinks by more > > > than 2GB (!) and the repository database shrinks from 2.4GB to 1.7GB. > > > > Which sizes did you compare? The d-i repo still includes plenty of > Current database versus reloaded cleaned database. > > vdelta revisions from repository format <= 3. A dump/load cycle should > > reduce the size anyway. > Ah, that is possible. The other advantages remain though. An easy estimate is the file size of the affected revisions in db/revs. > > Working copies with references to this revisions gets invalidated. > Hmm, yes that could be. Did not consider that. > But what risk is there that there _are_ (m)any working copies that > reference those revisions? The last commit I change was 08-01-2009, so > most users should have 'svn updated' by now. Really low, and the workaround is to remove the "broken" directories. > > > Because of the way tagging in subversion works, it is not possible to > > > do the cleanup and still keep the tagged versions exactly as they > > > were uploaded (see below for affected package versions). > > Please explain. A tag is just a copy, which can also include > > modifications. > A tag is a copy, but the files are not actually copied. So if I change the > file in trunk in a revision before the tag, the tagged version of the > file will automatically change as well. You can change the file along with the copy operation. > > > If we are agreed, I will pick a day to do the actual cleanup. During > > > part of that day the repository will be blocked for commits. > > There is not need to block anything. You can only change intermediate > > revisions, so the top is not affected. > I don't see how I could manipulate intermediate revs without rebuilding > the database from the bottom up. What exact procedure are you referring > to? I thought again and realized that the internal ids will not permit this. Another problem: does anyone use dumps with deltas? The hotbackup script bundled with subversion does this. Bastian -- Where there's no emotion, there's no motive for violence. -- Spock, "Dagger of the Mind", stardate 2715.1 signature.asc Description: Digital signature
Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository
On Thursday 04 June 2009, Frans Pop wrote: > On Thursday 04 June 2009, Bastian Blank wrote: > > Working copies with references to this revisions gets invalidated. > > OK. I'll test that and if it is a problem we'll have to warn about it. > I don't think it's a huge problem if such users would have to do a new > checkout. I've tested that now and it is indeed an issue. I installed the cleaned repo on a local server and then did a checkout of revision 56250 (in middle of cleanup) from the official repo. I then relocated the checkout to the local cleaned up repo and ran an svn up. Result was that I got checksum mismatch errors for the affected da.po files. But there's also a simple workaround. Just delete the parent directory of the "damaged" files, and svn will refetch that whole directory and continue happily. So users only have to delete selected packages//debian/po dirs to repair the damage. -- To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository
On Thursday 04 June 2009, peter green wrote: > > 3) The relevant versions are now no longer available anywhere [2]: > > they are no longer in the archive and we don't have a snapshot.d.n > > for that period. > > I don't think this statement is correct. snapshot.debian.net seems to > have all dates up to and including 2009/03/28, that date is after the > release of lenny rc1 afaict. Last time I checked, and that was quite some time ago, I thought it had stopped updating completely. But it looks like you're correct and the affected versions are available. After I sent the mail I decided that it would be a good idea to export the currently tagged versions and keep them separately on alioth somewhere, so that would cover that. I still don't think it's a major issue, but thanks for the correction. /me wonders when we'll be getting the long-promised snapshot.d.o... -- To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository
3) The relevant versions are now no longer available anywhere [2]: they are no longer in the archive and we don't have a snapshot.d.n for that period. I don't think this statement is correct. snapshot.debian.net seems to have all dates up to and including 2009/03/28, that date is after the release of lenny rc1 afaict. Note: the search function on snapshot.debian.net stopped updating long before the actual archiving stopped, and also 2009 doesn't appear in the index of archives (but the early 2009 stuff is accessible through manually typing urls) don't display properly. -- To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository
Thanks a lot for the reply, Bastian. On Thursday 04 June 2009, Bastian Blank wrote: > On Wed, Jun 03, 2009 at 10:42:39PM +0200, Frans Pop wrote: > > The way my cleanup works is that I remove all changes to the affected > > files made between revisions 55934 and 57133 (both inclusive). > > As a result of the cleanup the 'svnadmin dump' file shrinks by more > > than 2GB (!) and the repository database shrinks from 2.4GB to 1.7GB. > > Which sizes did you compare? The d-i repo still includes plenty of Current database versus reloaded cleaned database. > vdelta revisions from repository format <= 3. A dump/load cycle should > reduce the size anyway. Ah, that is possible. The other advantages remain though. > Working copies with references to this revisions gets invalidated. Hmm, yes that could be. Did not consider that. But what risk is there that there _are_ (m)any working copies that reference those revisions? The last commit I change was 08-01-2009, so most users should have 'svn updated' by now. Hmm. I guess some translators who worked on their translations in that period and haven't been active since could have such a checkout. OK. I'll test that and if it is a problem we'll have to warn about it. I don't think it's a huge problem if such users would have to do a new checkout. > > Because of the way tagging in subversion works, it is not possible to > > do the cleanup and still keep the tagged versions exactly as they > > were uploaded (see below for affected package versions). > > Please explain. A tag is just a copy, which can also include > modifications. A tag is a copy, but the files are not actually copied. So if I change the file in trunk in a revision before the tag, the tagged version of the file will automatically change as well. > > Essentially: not. > > This is incorrect. The effects are outlined in the Subversion FAQ and > references materials[1]. There does not seem anything there other than what we've already covered. We don't lose any revisions and all revisions + the state of HEAD remain completely identical to the current database. > > If we are agreed, I will pick a day to do the actual cleanup. During > > part of that day the repository will be blocked for commits. > > There is not need to block anything. You can only change intermediate > revisions, so the top is not affected. I don't see how I could manipulate intermediate revs without rebuilding the database from the bottom up. What exact procedure are you referring to? Blocking the repo for a few hours shouldn't be a major inconvenience anyway. It's not like we have a high commit rate ATM. > > BEGIN { > > clean = 0 > > infile = 0 > > } > > [...] > > I think you want svndumpfilter. I read about that, but I don't think it does what we need here: it only filters paths, not specific commits [1]. Anyway, my awk script is already there and I've tested that it does exactly what I want it to do. My cleaned dump file loads without any problems and I've done fairly extensive checks with svnlook that the database is as it should be after the load. Despite the warnings, the dumpfile format is relatively straightforward (and I did not use --incremental for my dump on purpose). Thanks again, FJP [1] Hmm. Guess it could maybe be used, but I'd need to create a dumpfile for exactly the range to be cleaned and it would need to be run separately for each file to be excluded. -- To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository
On Wed, Jun 03, 2009 at 10:42:39PM +0200, Frans Pop wrote: > The way my cleanup works is that I remove all changes to the affected > files made between revisions 55934 and 57133 (both inclusive). > As a result of the cleanup the 'svnadmin dump' file shrinks by more than > 2GB (!) and the repository database shrinks from 2.4GB to 1.7GB. Which sizes did you compare? The d-i repo still includes plenty of vdelta revisions from repository format <= 3. A dump/load cycle should reduce the size anyway. > As a result of the cleanup, some revisions (24 in total) become empty as > no other files were changed in that commit, but subversion handles this > without problems: a diff against the previous revision just shows empty. > I'll modify the revision comment to explain this. I'll also modify the > comments for revisions that caused the problem and the (now very small) > cleanup commits to explain the issue. Working copies with references to this revisions gets invalidated. > Because of the way tagging in subversion works, it is not possible to do > the cleanup and still keep the tagged versions exactly as they were > uploaded (see below for affected package versions). Please explain. A tag is just a copy, which can also include modifications. > Essentially: not. This is incorrect. The effects are outlined in the Subversion FAQ and references materials[1]. > If we are agreed, I will pick a day to do the actual cleanup. During part > of that day the repository will be blocked for commits. There is not need to block anything. You can only change intermediate revisions, so the top is not affected. > BEGIN { > clean = 0 > infile = 0 > } [...] I think you want svndumpfilter. Bastian [1]: http://subversion.tigris.org/faq.html#removal -- Is truth not truth for all? -- Natira, "For the World is Hollow and I have Touched the Sky", stardate 5476.4. signature.asc Description: Digital signature
[RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository
As you may remember we had a problem before the release of Lenny with the l10n-sync script running wild and creating an insanely large Danish PO file for sublevel 4. This was eventually corrected, but the commits increasing the size of that da.po master file to eventually 250MB (and the same again spread out the da.po files for several individual packages) are still there. These commits waste space on alioth and will also continue to cause problems, for example when people create a git-svn checkout [1]. Today I've looked at options to clean up the worst of the mess and I think I've found something that will work, but has one important consequence that needs to be discussed. At the bottom of the mail a list of affected files and packages. THE CLEANUP === The way my cleanup works is that I remove all changes to the affected files made between revisions 55934 and 57133 (both inclusive). As a result of the cleanup the 'svnadmin dump' file shrinks by more than 2GB (!) and the repository database shrinks from 2.4GB to 1.7GB. The cleanup starts _after_ the problems started, so the affected da.po files between the start of the problem (revision 55901) and the end of the cleanup are still not technically correct. However, they now remain only a little bit broken for the whole period instead of increasingly majorly broken. As a result of the cleanup, some revisions (24 in total) become empty as no other files were changed in that commit, but subversion handles this without problems: a diff against the previous revision just shows empty. I'll modify the revision comment to explain this. I'll also modify the comments for revisions that caused the problem and the (now very small) cleanup commits to explain the issue. The cleanup procedure is described below. THE PROBLEM === The issue occurred right around the release of D-I Lenny RC1. The Lenny branch was created in the middle of the period and all the affected packages were uploaded: first because of changes or an l10n upload series and later after the errors in the Danish translation were corrected in the Lenny branch. Because of the way tagging in subversion works, it is not possible to do the cleanup and still keep the tagged versions exactly as they were uploaded (see below for affected package versions). However, IMO the "damage" is acceptable, for the following reasons: 1) My cleanup stops _before_ the correction of the Danish translations in the Lenny branch by Christian. This means that the tags for the versions uploaded as a result of that, and also all versions released with Lenny, are 100% identical to what was uploaded. 2) For affected releases before that, tThe only file that is "incorrect" is the da.po file, the tagged version is still 100% correct for all other files in the packages. 3) The relevant versions are now no longer available anywhere [2]: they are no longer in the archive and we don't have a snapshot.d.n for that period. HOW DOES IT AFFECT USERS Essentially: not. During the cleanup the repository will be locked for commits. Users would be advised not to try to do an svn up: it should do no harm except possibly for the short time I'll be moving the cleaned repo in place. There is one minor effect for git-svn users who have the affected period in their history: their local git repository will no longer match the the SVN repository. But in practice that can do absolutely no harm. WHAT NOW? = The main question is if people agree with me that this cleanup is a good thing and that the problem described is not serious enough to block it. So: comments welcome! If we are agreed, I will pick a day to do the actual cleanup. During part of that day the repository will be blocked for commits. Cheers, FJP [1] Phil Hands' git-svn checkout got buggered as a result of this. [2] Not completely true: D-I Lenny RC1 images are still on the mirrors, but they will also disappear [3]. [3] BTW, looks like there are a number of old D-I releases in unstable that could be cleaned up. FTP masters will appreciate it. Affected files/packages --- po/sublevel4/da.po cdebconf/debian/po/da.po nobootloader/debian/po/da.po flash-kernel/debian/po/da.po partman/partman-prep/debian/po/da.po partman/partman-newworld/debian/po/da.po partman/partman-target/debian/po/da.po partman/partman-palo/debian/po/da.po partman/partman-ext2r0/debian/po/da.po partman/partman-efi/debian/po/da.po arch/sparc/silo-installer/debian/po/da.po arch/powerpc/prep-installer/debian/po/da.po arch/powerpc/quik-installer/debian/po/da.po arch/powerpc/yaboot-installer/debian/po/da.po arch/mips/sibyl-installer/debian/po/da.po arch/mips/arcboot-installer/debian/po/da.po arch/m68k/vmelilo-installer/debian/po/da.po Package versions that will have tags not 100% equal to upload - r55973 cdebconf 0.136 r55975 partm