De: Andrew C. West
> (Unfortunately I've just noticed that BabelPad has a slight
> bug with out of range GB-18030 values such as
> <E3 32 9A 36> = U+110000.)

Could an editor loading such incorrect but legacy GB-18030 file accept to
load it and work with it using an internal-only UCS-4 mapping (or an
extended UTF-8 mapping), to preserve those out of range sequences, as if
they were mapped in a extra PUA range?

Of course saving the file into a UTF encoding would be forbidden, but saving
the internal UCS-4 file back to GB-18030 would preserve those out-of-range
GB-18030 sequences, without making any other interpretation, and without
changing them arbitrarily into the GB18030 equivalent of U+FFFD?

The editor could still use the Unicode rules for all valid GB18030
sequences. And the invalid characters could be then represented for example
with a colored/highlighted glyph such as <U+110000>. As both the input and
output are not a Unicode scheme, I don't think this invalidates the Unicode
conformance: the behavior would just be conforming to GB18030 or other
legacy GB PUAs mappings.

Of course this editor will not be able to work on this text if its internal
encoding form is UTF-16, unless the editor uses aditional internal markup or
storage of GB sequences that were were mapped in the edit buffer as an
0xFFFD UTF-16 code unit. This "augmented text" with annotated values for
U+FFFD present in the text would then not be handled as if it was only
Unicode plain-text, but can constitute what Unicode calls an upper-layer
protocol, that is used to keep the original code sequences used in a
non-Unicode charset encoding and have no clear equivalent in Unicode.

The same thing could be used for example to map the "Apple logo" registered
character in files coded with MacRoman, instead of remapping it  to a weakly
interchangeable PUA: the out-of-band annotation of U+FFFD in the plain-text
part of the edited file would keep the track of the origin encoding of this
character, and the file may then be transmitted either in a latered form
with a UTF, or by using some other text encapsulation format: for example a
XML named entity (like "&apple-logo;") or a <char encoding="MacRoman"
bytes="XX"/> element, or a <img> reference (in HTML files).


__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com

<<attachment: winmail.dat>>

Reply via email to