On Sun, 10 May 2015 21:19:52 +0200 Philippe Verdy <[email protected]> wrote:
> The wy I read D77 (code unit) it is not bound to any Unicode encoding > form; Agreed. > "The minimal bit combination that can represent a unit of > encoded text for processing or interchange" can beany bit length and > can even use non binary repreentation (not bit-based; it could be > ternary; or floatting point, or base ten with the remaining bit > patterns posibly used for other functions (such as clock > synchronization!calibration, polarization balancing; lieving only > some patterns distinctable but not necessarily an exact power of > two...) I don't object to that reading, but I'm not sure it's correct. > I don't see why a 32-bit code unit or 8-bit code unit has to > be bound to UTF-32 or UTF-8 in D77; the code unit is just a code > unit; it does not have to be assigned any Unicode scalar value or > exist in a specific pattern valid for UTF-32 or UTF-8 (in addition > these two UTF's are not the only two ones supported; look as SCSU for > example; or GB18030 which are also conforming UTF's): D77 is definitely not bound to Unicode encoding forms - it gives Shift-JIS as an example of an encoding that has code units. > The code unit is just one element within an enumerable and finite set > of elements that is transmissible to some interface and > interchangeable. > > It's up to each UTF to define how they can use them: these UTF's are > usable on these stes provided that these sets are large nuitto > contain at least a the number of code units required for this UTF to > be supported (which means that the actual bitcount of the transported > code units does not matter; this is out of scope of TUS which jsut > requires sets with sufficient cardinality): The critical matter is the number of array elements needed for each scalar value and the pattern of which elements of the scalar values have the 'same' values. > For these reasons I absolutely do nt see why you argue that 0xFFFFFFFF > cannot be a valid 32-bit code unit Fair point so far. I agree it can be a 32-bit code unit in some character encoding. However, it is not a UTF-32 code unit. > and then why <0xFFFFFFFF> cant be a > valid 32-bit string I agree that it is a 32-bit string. I don't know what you mean by the word 'valid' in this context. > (or "Unicode 32-bit string> liek TUS renames it in > D80-D83 in a way that is really unproductive (and in fact confusive). I hope you now see that it cannot be Unicode 32-bit string, for 0xFFFFFFFF is not a UTF-32 code unit. This is a key point in the difference between: a) x-bit string, b) Unicode x-bit string, and c) UTF-x string For x=8, these are three different things. For x=16 or x=32, these are two different things, but they do not split the same way. D80-D83 do not directly rename 8-bit strings, 16-bit strings or 32-bit strings as Unicode 8-bit strings, Unicode 16-bit strings or Unicode 32-bit strings. That all 16-bit strings are Unicode 16-bit strings is a consequence of the definition of UTF-16. Similarly, not all 8-bit strings being Unicode 8-bit strings and not all 32-bit strings are consequences of the definitions of UTF-8 and UTF-32 respectively. I agree that the concept of Unicode 8-bit strings is not useful. The separate concept of Unicode 32-bit strings is also not useful, for I contend that all Unicode 32-bit strings are in fact UTF-32 strings. The latter result is an immediate consequence of UTF-32 not being a multi-code unit encoding. > As well nothing prohibits supportng the UTF-32 encoding form over a > 21-bit stream, using another "encding scheme" (which cannt be named > also UTF-32 or UT-32BE or UTF-32LE" but could be named 'UTF-32-21": > the result witll be a 21-bit strng; but still the 21(bit code unit > 0x1FFFFF will still be valid. > > 2015-05-10 12:23 GMT+02:00 Richard Wordingham < > [email protected]>: > > > On Sun, 10 May 2015 07:42:14 +0200 > > Philippe Verdy <[email protected]> wrote: > > > > I as replying out of order for greater coherence of my reply. > > > > > However I wonder what would be the effect of D80 in UTF-32: is > > > <0xFFFFFFFF> a valid "32-bit string" ? After all it is also > > > containing a single 32-bit code unit (for at least one Unicode > > > encoding form), even if it has no "scalar value" and then does not > > > have to validate D89 (for UTF-32)... > > > > The value 0xFFFFFFFF cannot appear in a UTF-32 string. Therefore it > > cannot represent a unit of encoded text in a UTF-32 string. By D77 > > paragraph 1, "Code unit: The minimal bit combination that can > > represent a unit of encoded text for processing or interchange", it > > is therefore not a code unit. Correction: "is therefore not a UTF-32 code unit." > > The effect of D77, D80 and D83 is > > that <0xFFFFFFFF> is a 32-bit string but not a Unicode 32-bit > > string. > > > > > - D80 defines "Unicode string" but in fact it just defines a > > > generic "string" as an arbitrary stream of fixed-size code units. > > > > No - see argument above. > > > > > These two rules [D80 and D82 - RW] are not productive at all, > > > except for saying that all values of fixed size code units are > > > acceptable (including for example 0xFF in 8-bit strings, which is > > > invalid in UTF-8) I ask again: Do you still maintain this reading of D77? D77 is not as clear as it should be. Richard.

