The issue is that the bad character was added in 2004, see
https://zh.wikipedia.org/w/index.php?title=Wikipedia:%E6%96%B0%E9%97%BB%
E7%A8%BF/2004%E5%B9%B42%E6%9C%88_%28%E7%AE%80%
29action=editoldid=386385
before there were aggressive checks for that sort of thing. Garbage in,
garbage out... Neither the dump producer scripts nor the db table
inserts are going to alter data they are given on the grounds that it's
bad utf8. At least this particular instance is easy to fix for any
zhwiki editor.
Ariel
Στις 05-01-2013, ημέρα Σαβ, και ώρα 19:58 +0100, ο/η Mathieu Poumeyrol
έγραψε:
All,
I've been struggling to track this for a few hours. This file is a SQL dump,
the headers says itf UTF-8.
http://dumps.wikimedia.org/zhwiki/20130102/zhwiki-20130102-langlinks.sql.gz
but:
$ isutf8 zh-langlinks.sql
zh-langlinks.sql: line 204, char 2361, byte offset 520707: invalid UTF-8 code
$ head -204 zh-langlinks.sql | tail -1 | head -c 520750 | tail -c 50 |
hexdump -C
64 69 61 3a 43 6f f6 72 64 69 6e 61 74 69 65 20 |dia:Co.rdinatie |
0010 65 78 74 65 72 6e 65 20 70 75 62 6c 69 63 69 74 |externe publicit|
0020 65 69 74 2f 69 6e 74 65 72 6e 61 74 69 6f 6e 61 |eit/internationa|
0030 61 6c |al|
0032
There might be other occurencies, but one is enough to make my import scripts
crash, so... you guys are warned.
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l