On 29/03/2010 09:00, Pierre Schmitz wrote:
Am Montag, 29. März 2010 01:21:09 schrieb Allan McRae:
Is there any progress on fixing this?  There are a lot of packaging
notes on those pages that would be a shame to lose.

I did it last Thursday. I've done my best to repair the mysql backup Loui pointed me at. I'd say it's 95% fixed now, but the procedure left a few isolated illegal characters in its trail (like this: �), especially within Cyrillic and CJK. The text should be legible however. You can compare the original on sigurd
/srv/http/aur.archlinux.org/backup/aur-20100205-1859.sql.gz
with my repaired version:
/home/francois/aur-20100205-1859.sql.fixed2.xz
and judge whether any further effort is needed or justified.
It's very likely the same issue I had updating the wiki. This is caused by a
mysql packaging change which switched the default encoding from latin1 to
utf8. Here are some tips:
http://en.gentoo-wiki.com/wiki/Convert_latin1_to_UTF-8_in_MySQL

But I guess we lost the chance to fix this more or less easily because the AUR
content has changed since the last backup.
Indeed. Believe me, the encoding of the strings was in a terrible mess (mostly the comments, but also the names of users), so it was no longer simply a matter of doing a conversion from one charset to another. Basically what I did was to convert from windows-1252 (!) to UTF-8, and then repair all "doubly-encoded" UTF-8 characters using the perl module Encode::DoubleEncodedUTF8 (on CPAN). But as I said above, there is no way to automatically recover everything from that one backup alone.

This requires some kind of script
that imports and merges the old and new comments.

The problem with that "import and merge" operation – unless it is done with a reliable and well-tested tool – is that it risks damaging the data more than it currently is ;) I'll leave it to Loui to decide whether it's worth the trouble.

F

Reply via email to