Dominikus Scherkl wrote:
> I would like to have a "source failure indicator symbol" (SFIS)
> charakter in the unicode, which a charset-convertion unit may
> insert into a text (Suggeested position: U+FFF8).
> 
> [...]
> 
> Of course a converter can still use U+FFFD if it has no
> idea which character is intended or if unicode doesn't contain
> the character.

I remember reading on this list about a proposal to allocate 256 code points
to represent the bytes of a non-Unicode character set which could not be
converted to Unicode.

What happened to that proposal? Was it ever formalized? If yes, was it
refused?

> The whole "charakter identities"-discussion gave me another
> reason to introduce such a SFIS-charakter:
> A font-renderer may show the SFIS before a charakter which
> is replaced by another one [...]

Sorry for repeating myself, but my opinion is that a renderer is *never*
allowed to change one character to another. IMHO, all that discussion was
about the shape of glyphs, not about changing characters.

> I'd like to hear if my suggestion is completely weird or
> if anybody else think it might be useful.

One problem can be the nature of the code point which follows the "SFIS".

Imagine that a stream, encoded in a certain character set, contains the byte
0xBF and that this byte is undefined in that character set. Mapping the
stream to Unicode, you convert 0xBF into a sequence of "SFIS" and U+00BF.
Clearly, that U+00BF would just be a placeholder for the unknown byte, not
an "INVERTED QUESTION MARK". 

The problem is that interpreting U+00BF as anything different from an
"INVERTED QUESTION MARK" violates Unicode Conformance Requirement C7: "A
process shall interpret a coded character representation according to the
character semantics established by this standard, if that process does
interpret that coded character representation."

Another problem, more practical, is that if the unrecognized byte is in
ranges 0x00..0x1F and 0x7F..0x9F, this would generate the code point of an
Unicode control character, and this could have undesired effects. E.g.,
U+0000 is often a string terminator; U+001B could trigger unexpected escape
sequences, etc.

_ Marco

Reply via email to