De: Andrew C. West > (Unfortunately I've just noticed that BabelPad has a slight > bug with out of range GB-18030 values such as > <E3 32 9A 36> = U+110000.)
Could an editor loading such incorrect but legacy GB-18030 file accept to load it and work with it using an internal-only UCS-4 mapping (or an extended UTF-8 mapping), to preserve those out of range sequences, as if they were mapped in a extra PUA range? Of course saving the file into a UTF encoding would be forbidden, but saving the internal UCS-4 file back to GB-18030 would preserve those out-of-range GB-18030 sequences, without making any other interpretation, and without changing them arbitrarily into the GB18030 equivalent of U+FFFD? The editor could still use the Unicode rules for all valid GB18030 sequences. And the invalid characters could be then represented for example with a colored/highlighted glyph such as <U+110000>. As both the input and output are not a Unicode scheme, I don't think this invalidates the Unicode conformance: the behavior would just be conforming to GB18030 or other legacy GB PUAs mappings. Of course this editor will not be able to work on this text if its internal encoding form is UTF-16, unless the editor uses aditional internal markup or storage of GB sequences that were were mapped in the edit buffer as an 0xFFFD UTF-16 code unit. This "augmented text" with annotated values for U+FFFD present in the text would then not be handled as if it was only Unicode plain-text, but can constitute what Unicode calls an upper-layer protocol, that is used to keep the original code sequences used in a non-Unicode charset encoding and have no clear equivalent in Unicode. The same thing could be used for example to map the "Apple logo" registered character in files coded with MacRoman, instead of remapping it to a weakly interchangeable PUA: the out-of-band annotation of U+FFFD in the plain-text part of the edited file would keep the track of the origin encoding of this character, and the file may then be transmitted either in a latered form with a UTF, or by using some other text encapsulation format: for example a XML named entity (like "&apple-logo;") or a <char encoding="MacRoman" bytes="XX"/> element, or a <img> reference (in HTML files). __________________________________________________________________ << ella for Spam Control >> has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com
<<attachment: winmail.dat>>