Re: [Wikitech-l] XML and Unicode chars in tag names

Strainu Sun, 14 Jul 2013 15:09:59 -0700

2013/7/15 Bjoern Hoehrmann <derhoe...@gmx.net>:
>
> I18N issues are far easier to debug with access to the actual bytes that
> demonstrate the problem. Copying and pasting text into an email adds and
> obscures potential problems. You should also always give the exact error
> messages you are receiving and not your interpretation of them.


Hi Bjoern,

Thanks for your extensive answer.  I will keep that in mind. The
actual url used for testing is below.

>
> I am guessing your file is not actually UTF-8 encoded.

This doesn't seem to be the case:

> wget "http://despresate.strainu.ro/judet.php?id=15&f=xml&t=all&commune=all"; 
> -O 1.xml
2013-07-15 00:37:58 (178 KB/s) - `1.xml' saved [31081]

> enca -L none 1.xml
Universal transformation format 8 bits; UTF-8

> file -bi 1.xml
application/xml; charset=utf-8

>
> I then made a minimal test case, `<` followed by U+0163 and `/>` making
> sure the document is UTF-8 encoded and loaded that in a browser that I
> know checks for illegal characters in names.
>
>   data:application/xml,%3c%c5%a3%2f%3e
>
> That worked fine so your problem description is incorrect or incomplete.
> I would recommend having the `xmllint` frontend to libxml2 around and do
> `xmllint example.xml`. That, too, works fine for my test case.

xmllint works for me too (for 1.xml). Still, Firefox insists there is
a problem in the xml file, but Chromium is ok with the same file.

>
> I take it from your later mail that you are getting `UnicodeEncodeError`
> in Python. You asked Python to encode U+0219 using the `ascii` codec and
> Python is telling you that U+0219 cannot be encoded using that codec.
> You have to check what kind of string `fromstring` expects (byte string
> or character string or what) and then check how to create such a string
> in Python from a literal in the source code. You might need a u'' string
> and call .encode('utf-8') on it.

Correct, that was simply not utf8, my mistake. Reading directly from
the file (including the http url) works here too.

Still, it seems to me that unicode char support in tag names is
sketchy.  Would you recommend that I go ahead with those names or it
would be wiser, for the sake of reusers, to keep to the ascii letters?

Thanks all for your help,
   Strainu

> --
> Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
> Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
> 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] XML and Unicode chars in tag names

Reply via email to