Philippe continued: > As if Unicode had to be bound on > architectural constraints such as the requirement of representing code units > (which are architectural for a system) only as 16-bit or 32-bit units,
Yes, it does. By definition. In the standard. > ignoring the fact that technologies do evolve and will not necessarily keep > this constraint. 64-bit systems already exist today, and even if they have, > for now, the architectural capability of handling efficiently 16-bit and > 32-bit code units so that they can be addressed individually, this will > possibly not be the case in the future. This is just as irrelevant as worrying about the fact that 8-bit character encodings may not be handled efficiently by some 32-bit processors. > When I look at the encoding forms such as UTF-16 and UTF-32, they just > define the value ranges in which code units will be be valid, but not > necessarily their size. Philippe, you are wrong. Go reread the standard. Each of the encoding forms is *explicitly* defined in terms of code unit size in bits. "The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form." If there is something ambiguous or unclear in wording such as that, I think the UTC would like to know about it. > You are mixing this with encoding schemes, which is > what is needed for interoperability, and where other factors such as bit or > byte ordering is also important in addition to the value range. I am not mixing it up -- you are, unfortunately. And it is most unhelpful on this list to have people waxing on, with apparently authoritative statements about the architecture of the Unicode Standard, which on examination turn out to be flat wrong. > I won't see anything wrong if a system is set so that UTF-32 code units will > be stored in 24-bit or even 64-bit memory cells, as long as they respect and > fully represent the value range defined in encoding forms, Correct. And I said as much. There is nothing wrong with implementing UTF-32 on a 64-bit processor. Putting a UTF-32 code point into a 64-bit register is fine. What you have to watch out for is handing me a 64-bit array of ints and claiming that it is a UTF-32 sequence of code points -- it isn't. > and if the system > also provides an interface to convert them with encoding schemes to > interoperable streams of 8-bit bytes. No, you have to have an interface which hands me the correct data type when I declare it uint_32, and which gives me correct offsets in memory if I walk an index pointer down an array. That applies to the encoding *form*, and is completely separate from provision of any streaming interface that wants to feed data back and form in terms of byte streams. > Are you saying that UTF-32 code units need to be able to represent any > 32-bit value, even if the valid range is limited, for now to the 17 first > planes? Yes. > An API on a 64-bit system that would say that it requires strings being > stored with UTF-32 would also define how UTF-32 code units are represented. > As long as the valid range 0 to 0x10FFFF can be represented, this interface > will be fine. No, it will not. Read the standard. An API on a 64-bit system that uses an unsigned 32-bit datatype for UTF-32 is fine. It isn't fine if it uses an unsigned 64-bit datatype for UTF-32. > If this system is designed so that two or three code units > will be stored in a single 64-bit memory cell, no violation will occur in > the valid range. You can do whatever the heck crazy thing you want to do internal to your data manipulation, but you cannot surface a datatype packed that way and conformantly claim that it is UTF-32. > More interestingly, there already exists systems where memory is adressable > by units of 1 bit, and on these systems, ... [excised some vamping on the future of computers] > Nothing there is impossible for the future (when it will become more and > more difficult to increase the density of transistors, or to reduce further > the voltage, or to increase the working frequency, or to avoid the > inevitable and random presence of natural defects in substrates; escaping > from the historic binary-only systems may offer interesting opportunities > for further performance increase). Look, I don't care if the processors are dealing in qubits on molecular arrays under the covers. It is the job of the hardware folks to surface appropriate machine instructions that compiler makers can use to surface appropriate formal language constructs to programmers to enable hooking the defined datatypes of the character encoding standards into programming language datatypes. It is the job of the Unicode Consortium to define the encoding forms for representing Unicode code points, so that people manipulating Unicode digital text representation can do so reliably using general purpose programming languages with well-defined textual data constructs. I believe it has done so. No amount of blueskying about the future of optical or quantum computing actually changes that situation one bit. ;-) --Ken