2014-07-02 20:19 GMT+02:00 David Starner <prosfil...@gmail.com>: > I might argue 11111111b for 0x00 in UTF-8 would be technically > legal
It is not. UTF-8 specifies the effective value of each 8-bit byte, if you store 11111111b in that byte you have exactly the same result as when storing 0xFF or -1 (unless your system uses "bytes" larger than 8-bits (the time of PDP mainframes with 8-bit bytes is over since long, all devices around use 8-bit byte values on their interface, even if they may internally encode exposed bits with longer sequences, such as with MFM encodings, or by adding extra control and clock/sync bits, or could use three rotating sequences of 3 states with automatic synchronization by negative or positive transitions at every encoded bit position, plus some breaking rules on some bits to find start of packets) the standard never specifies which bit sequences correspond to > which byte values--but \xC0\x80 would probably be more reliably > processed by existing code. But the same C libraries are also using -1 as end-of-stream values and if they are converted to bytes, they will be undistinctable from the NULL character that could be stored everywhere in the stream. The main reason why 0xC0,0x80 was chosen instead of 00 is historic in Java when its JNI interface only used strings encoded on 8-bit sequences without a separate parameter to specify the length of the encoded sequence. 0x00 was then used like in the basic ANSI C string library (string.h and stdio.h) and Java was ported on heterogeneous systems (including those small devices whose "int" type was also 8-bit only, blocking the use of BOTH 0x00 and 0xFF in some system I/O APIs). At least 0xC0,0x80 was safe (and not used by UTF-8, but at that time UTF-8 was still not even a standard previsely defined, and it was legal to represent U+0000 as 0xC0,0x80, the prohibition of over long sequences in UTF-8 or Unicode came many years later, Java used the early informative-only RFC specification, which was also supported by ISO, before ISO1646-1 and Unicode 1.1 were aligned). The Unicode and ISO1646 have changed (both in incompatible way) but it was necessary to have both standards compatible with each other. Java could not change its ABI for JNI, it was too late. However Java added another UTF16-based interface for strings to JNI. But still this interface does not enforce UTF-16 rules about paired surrogates (just like C, C++ or even Javascript). But the added 16-bit string interface for JNI has a separate field for storing the encoded sting length (in 16-but code units), so that interface uses the standard 0x0000 value for U+0000. As much as possible JNI extension ibraries should use that 16-bit interface (which is simpler to handle also with modern OS APIs compatible woth Unicode, notably on Windows). But the 8-bit iJNI interface is still commonly used in JNI extension libraries for Unix/Linux (because it is safer to handle the conversion from 16-bit to 8-bit in the JVM than in the external JNI library using its own memory allocation and unable to use the garbage collector of the managed memory of the JVM). The Java-modified-UTF8 encoding is still used in the binary encoding of compiled class files (this is invisible to applications that only see 16-bit encoded strings, unless they have to parse or generate compiled class files)
_______________________________________________ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode