Re: [Xmldatadumps-l] Encoding issue in the last ZH dump

2013-01-08 Thread Ariel T. Glenn
The issue is that the bad character was added in 2004, see

https://zh.wikipedia.org/w/index.php?title=Wikipedia:%E6%96%B0%E9%97%BB%
E7%A8%BF/2004%E5%B9%B42%E6%9C%88_%28%E7%AE%80%
29action=editoldid=386385

before there were aggressive checks for that sort of thing.  Garbage in,
garbage out...  Neither the dump producer scripts nor the db table
inserts are going to alter data they are given on the grounds that it's
bad utf8.  At least this particular instance is easy to fix for any
zhwiki editor.

Ariel

Στις 05-01-2013, ημέρα Σαβ, και ώρα 19:58 +0100, ο/η Mathieu Poumeyrol
έγραψε:
 All,
 
 I've been struggling to track this for a few hours. This file is a SQL dump, 
 the headers says itf UTF-8.
 
 http://dumps.wikimedia.org/zhwiki/20130102/zhwiki-20130102-langlinks.sql.gz
 
 but:
 
 $ isutf8 zh-langlinks.sql 
 zh-langlinks.sql: line 204, char 2361, byte offset 520707: invalid UTF-8 code
 
 $ head -204 zh-langlinks.sql | tail -1 | head -c 520750 | tail -c 50 | 
 hexdump -C
   64 69 61 3a 43 6f f6 72  64 69 6e 61 74 69 65 20  |dia:Co.rdinatie |
 0010  65 78 74 65 72 6e 65 20  70 75 62 6c 69 63 69 74  |externe publicit|
 0020  65 69 74 2f 69 6e 74 65  72 6e 61 74 69 6f 6e 61  |eit/internationa|
 0030  61 6c |al|
 0032
 
 There might be other occurencies, but one is enough to make my import scripts 
 crash, so... you guys are warned.
 



___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Encoding issue in the last ZH dump

2013-01-08 Thread Federico Leva (Nemo)

Ariel T. Glenn, 08/01/2013 09:26:

The issue is that the bad character was added in 2004, see

https://zh.wikipedia.org/w/index.php?title=Wikipedia:%E6%96%B0%E9%97%BB%
E7%A8%BF/2004%E5%B9%B42%E6%9C%88_%28%E7%AE%80%
29action=editoldid=386385


I've requested removal and revdeletion: 
https://zh.wikipedia.org/w/index.php?diff=24435408oldid=691618
Mathieu, please follow the discussion in case they ask questions on UTF8 
or on why this is important for you, neither of which I'd be able to 
answer...

Thanks,
Nemo

___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l