[I'm CC-ing the unicode list again because I'm doing some fairly sophisticated interpretation of the Unicode conformance requirements below and I'd like to have someone with more experience with this check my reasoning.] On Wed, 27 Jun 2001, [EMAIL PROTECTED] wrote: >> This is wrong. It is a bug to encode a non-BMP character with six >> bytes by pretending that the (surrogate) values used in the UTF-16 >> representation are BMP characters and encoding the character as >> though it was a string consisting of that character. It is also a >> bug to interpret such a six-byte sequence as a single character. >> This was clarified in Unicode 3.1. > > It seems to be unclear to many, including myself, what exactly was > clarified with Unicode 3.1. See the section called "UTF-8 Corrigendum" in TR 27. It explains it all in detail. > Where exactly does it say that processing a six-byte two-surrogates > sequence as a single character is non-conforming? See D39(c) at <http://www.unicode.org/unicode/reports/tr27>. This defines such a six-byte sequence as an "irregular UTF-8 code unit sequence" and goes on to state that, as a consequence of C12, conforminig processes are not allowed to generate such sequences. This really ought to be obvious anyway: UTF-8 is defined to represent a given USV with 1 to 4 bytes, so clearly 6 is not possible. Conversely, C12(a) states that a conformant process can not produce "ill-formed code unit sequences" while producing data in a UTF. The definition of this term is given in D30 as a code unit sequence that can not be produced from a sequence of unicode scalar values. This is where things get somewhat more interesting. Somewhat surprisingly, the definition of "Unicode Scalar Value" has not been changed from 3.0 to 3.1. The reason why one might expect this to have changed is that in 3.0 UTF-16 was "the" unicode format, so that USVs were defined in terms of UTF-16 code points. In 3.1 it is stated elsewhere that different UTFs are simply conrete ways to store sequences of USVs. However, the definition of USV is still either: A value in the range 0 - 0xFFFF which is is not a high or low surrogate in UTF-16, or: a value in the range 0x10000 - 0x10FFFF which is obtained by taking a pair of values that form a high and low surrogate respectively in UTF-16 and applying the usual formula. Since there is no way you can form a value in the range 0xD800 - 0xDFFF in this fashion it follows that a USV can not be in this range. Therefore you are not allowed to create a 3 byte sequence that is the UTF-8 encoding of value in this range. Therefore you are not allowed to generate pairs of such sequences either. I hope this is all clear. One very important thing to keep in mind when doing this stuff is that 3.1 is a brand new standard, less than one and a half months old. A consequence of this is that most of the material on the Unicode web site still refers to version 3.0, so you have to be very careful to check that the information you're looking at is in fact up to date. (The only updated information I could find was TR 27 and [probably] the data tables.) > What exactly does it say that the conforming behaviour should be? Argh. Treat it as an error, probably. You go and read the standard yourself, my head is already hurting. 8-) >> Personally, I think that the codecs should report an error in the >> appropriate fashion when presented with a python unicode string >> which contains values that are not allowed, such as lone >> surrogates. > > Other people have read Unicode 3.1 and came to the conclusion that > it mandates that implementations accept such a character... Well, they're wrong. The standard is clear as ink in this regard. -- Big Gaute http://www.srcf.ucam.org/~gs234/ I can't think about that. It doesn't go with HEDGES in the shape of LITTLE LULU -- or ROBOTS making BRICKS...