On Jan 10, 2014, at 3:10 PM, John Gilmore <jwgli...@gmail.com> wrote:
> I have, however, found all of the UTF-8 implementations I have used > both unsatisfactory and unreliable in the literal sense that > conversions into UTF-8 from UTF-16 using them do not always yield the > same results. Is the issue related to surrogate pairs? This is in the FAQ I linked to in my previous email: Q: How do I convert a UTF-16 surrogate pair such as <D800 DC00> to UTF-8? A one four byte sequence or as two separate 3-byte sequences? A: The definition of UTF-8 requires that supplementary characters (those using surrogate pairs in UTF-16) be encoded with a single four byte sequence. However, there is a widespread practice of generating pairs of three byte sequences in older software, especially software which pre-dates the introduction of UTF-16 or that is interoperating with UTF-16 environments under particular constraints. Such an encoding is not conformant to UTF-8 as defined. See UTR #26: Compatability Encoding Scheme for UTF-16: 8-bit (CESU) for a formal description of such a non-UTF-8 data format. When using CESU-8, great care must be taken that data is not accidentally treated as if it was UTF-8, due to the similarity of the formats. [AF] > > If I have one, I suppose that English is my mother tongue; but, unlike > some of you, my preoccupations ane not exclusively or even > predominantly anglophone. I am a polyglot. There is no effective > appeal from my determination that a passage from Leopardi, say, is > mangled when it is converted/moved from UTF-16 to UTF-8 Then whatever converted it for you has a bug, because there is an isomorphic relationship between UTF-16 and UTF-8. > > I have of course reported these anomalies to the appropriate Unicode bodies. Perhaps you should report it to whoever created your conversion software. -- Curtis Pew (c....@its.utexas.edu) ITS Systems Core The University of Texas at Austin ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN