https://bugzilla.wikimedia.org/show_bug.cgi?id=22137

--- Comment #6 from Platonides <platoni...@gmail.com> 2010-02-12 22:58:48 UTC 
---
Java internally uses UTF-16

"The native coded character set of the Java programming language is that of the
first seventeen planes of the Unicode version 3.0 character set; that is, it
consists in the basic multilingual plane (BMP) of Unicode version 1 plus the
next sixteen planes of Unicode version 3. This is because the language's
internal representation of characters uses the UTF-16 encoding, which encodes
the BMP directly and uses surrogate pairs, a simple escape mechanism, to encode
the other planes. Hence a charset in the Java platform defines a mapping
between sequences of sixteen-bit values in UTF-16 and sequences of bytes."
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html
http://java.sun.com/javase/6/docs/api/java/nio/charset/Charset.html

The file contains U+01D59F in UTF-8, thus F0 9D 96 9F. In binary 11110000
10011101 10010110 10011111
I don't see why it is reading a U+26 (100110).


PS: Maybe bugzilla is using mysql as utf-8 instead of binary? mysql unicode
currently only supports the BMP.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching all bug changes.

_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to