2017-07-25 0:35 GMT+02:00 Doug Ewell via Unicode <[email protected]>:
> J Decker wrote: > > > I generally accepted any utf-8 encoding up to 31 bits though ( since > > I was going from the original spec, and not what was effective limit > > based on unicode codepoint space) > > Hey, everybody: Don't do that. > > UTF-8 has been constrained to the Unicode code space (maximum U+10FFFF, > four bytes) for almost fourteen years now. I fully agree. This is now an essential part of UTF-8 that has helped secure it (including the dangerous unbound loops scanning through buffers in memory), and also helped improve performance (when unrolling loops that you no longer need to count separately, the code expansion is not so large that you can't do correct branch prediction and can benefit of caching in code. Due to the way the UCS code spacez is allocated and how they are used, the branches in your code have very distinctive patterns that are easy to enumerate; test coverage for those branches is possible without explosing combinatorially: this eliminates the need of heuristics. And about the RFC we were discussing, it is rather recent compared to the approved stabilization of UTF-8 and finally its endorsement by the industry. UTF-8 is strictly bound to 4 bytes and nothing more. This allows other things to be developed on top of this fact and used now as a checked assumption that cannot be broken except by software bugs that will soon create security problems when checked assumptions will no longer be checked throughout a processing chain. The old RFC was not "UTF-8" (even if that name was proposed, it was not really assigned) but an early proposal in discussion that did not reach the level of standard or best practice, it was experimental and at that time there were several other candidates (including also UTF-7 which is now almost abandoned, and BOCU-8 which is now marginal but was also bound to the 17 planes limit). The encoding old RFC should just be given another name, but it is not used for encoding only text, it was describing in fact a binary format (but for generic variable binary encoding format of numbers there are now better candidates, which are also not limited to just 31 bits or even just to unsigned integers, and are also faster to process and more compact, and have more interesting properties for code analysis and resistance to encoding and transmission/storage errors). In the IANA database for charsets, the old RFC encoding has a separate identifier, but "UTF-8" refers to RFC 3629 (IETF standard 63); the former proposals in RFC 2279 or RFC 2044 have never been approved standards, but just drafts mapped in IANA as the obsolete "UNICODE-1-1-UTF-8" (retired later as it was never approved by Unicode). The only remaining "charset" in the IANA database that refers to 31 bit code points is "ISO-10646-UCS-4", but it does not use variable encoding and does not specify any byte order, it is just a basic subtype for a range of positive integers, and without any restriction of use, and not necessarily repreenting text, but it is very inefficient way to encode them, only meant as an internal temporary transform in transient memory or CPU registers (at least for 32bit CPUs or higher: it is now almost alway the case today even in embedded systems, as 4-, 8- or16-bit CPUs are almost dead or will not be used for international text processing; even the simplest keyboard controlers that manage ~100-150 keys and a few leds, and reporting at 1kHz for the fastest ones, are now internally using 32bit CPUs)

