Re: [Xmldatadumps-l] Encoding issue in the last ZH dump
Ariel T. Glenn, 08/01/2013 09:26: The issue is that the bad character was added in 2004, see https://zh.wikipedia.org/w/index.php?title=Wikipedia:%E6%96%B0%E9%97%BB% E7%A8%BF/2004%E5%B9%B42%E6%9C%88_%28%E7%AE%80% 29&action=edit&oldid=386385 I've requested removal and revdeletion: https://zh.wikipedia.org/w/index.php?diff=24435408&oldid=691618 Mathieu, please follow the discussion in case they ask questions on UTF8 or on why this is important for you, neither of which I'd be able to answer... Thanks, Nemo ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Re: [Xmldatadumps-l] Encoding issue in the last ZH dump
The issue is that the bad character was added in 2004, see https://zh.wikipedia.org/w/index.php?title=Wikipedia:%E6%96%B0%E9%97%BB% E7%A8%BF/2004%E5%B9%B42%E6%9C%88_%28%E7%AE%80% 29&action=edit&oldid=386385 before there were aggressive checks for that sort of thing. Garbage in, garbage out... Neither the dump producer scripts nor the db table inserts are going to alter data they are given on the grounds that it's bad utf8. At least this particular instance is easy to fix for any zhwiki editor. Ariel Στις 05-01-2013, ημέρα Σαβ, και ώρα 19:58 +0100, ο/η Mathieu Poumeyrol έγραψε: > All, > > I've been struggling to track this for a few hours. This file is a SQL dump, > the headers says itf UTF-8. > > http://dumps.wikimedia.org/zhwiki/20130102/zhwiki-20130102-langlinks.sql.gz > > but: > > $ isutf8 zh-langlinks.sql > zh-langlinks.sql: line 204, char 2361, byte offset 520707: invalid UTF-8 code > > $ head -204 zh-langlinks.sql | tail -1 | head -c 520750 | tail -c 50 | > hexdump -C > 64 69 61 3a 43 6f f6 72 64 69 6e 61 74 69 65 20 |dia:Co.rdinatie | > 0010 65 78 74 65 72 6e 65 20 70 75 62 6c 69 63 69 74 |externe publicit| > 0020 65 69 74 2f 69 6e 74 65 72 6e 61 74 69 6f 6e 61 |eit/internationa| > 0030 61 6c |al| > 0032 > > There might be other occurencies, but one is enough to make my import scripts > crash, so... you guys are warned. > ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
[Xmldatadumps-l] Encoding issue in the last ZH dump
All, I've been struggling to track this for a few hours. This file is a SQL dump, the headers says itf UTF-8. http://dumps.wikimedia.org/zhwiki/20130102/zhwiki-20130102-langlinks.sql.gz but: $ isutf8 zh-langlinks.sql zh-langlinks.sql: line 204, char 2361, byte offset 520707: invalid UTF-8 code $ head -204 zh-langlinks.sql | tail -1 | head -c 520750 | tail -c 50 | hexdump -C 64 69 61 3a 43 6f f6 72 64 69 6e 61 74 69 65 20 |dia:Co.rdinatie | 0010 65 78 74 65 72 6e 65 20 70 75 62 6c 69 63 69 74 |externe publicit| 0020 65 69 74 2f 69 6e 74 65 72 6e 61 74 69 6f 6e 61 |eit/internationa| 0030 61 6c |al| 0032 There might be other occurencies, but one is enough to make my import scripts crash, so... you guys are warned. -- K. ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l