On Jan 10, 2014, at 3:10 PM, John Gilmore <jwgli...@gmail.com> wrote:

> I have, however, found all of the UTF-8 implementations I have used
> both unsatisfactory and unreliable in the literal sense that
> conversions into UTF-8 from UTF-16 using them do not always yield the
> same results.

Is the issue related to surrogate pairs? This is in the FAQ I linked to in my 
previous email:

Q: How do I convert a UTF-16 surrogate pair such as <D800 DC00> to UTF-8? A one 
four byte sequence or as two separate 3-byte sequences?

A: The definition of UTF-8 requires that supplementary characters (those using 
surrogate pairs in UTF-16) be encoded with a single four byte sequence. 
However, there is a widespread practice of generating pairs of three byte 
sequences in older software, especially software which pre-dates the 
introduction of UTF-16 or that is interoperating with UTF-16 environments under 
particular constraints. Such an encoding is not conformant to UTF-8 as defined. 
See UTR #26: Compatability Encoding Scheme for UTF-16: 8-bit (CESU) for a 
formal description of such a non-UTF-8 data format. When using CESU-8, great 
care must be taken that data is not accidentally treated as if it was UTF-8, 
due to the similarity of the formats. [AF]


> 
> If I have one, I suppose that English is my mother tongue; but, unlike
> some of you, my preoccupations ane not exclusively or even
> predominantly anglophone.  I am a polyglot.  There is no effective
> appeal from my determination that a passage from Leopardi, say, is
> mangled when it is converted/moved from UTF-16 to UTF-8

Then whatever converted it for you has a bug, because there is an isomorphic 
relationship between UTF-16 and UTF-8.

> 
> I have of course reported these anomalies to the appropriate Unicode bodies.

Perhaps you should report it to whoever created your conversion software.

-- 
Curtis Pew (c....@its.utexas.edu)
ITS Systems Core
The University of Texas at Austin

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Reply via email to