Dominikus Scherkl replied to Markus:

> > > My other suggestion (and the main reason to call the proposed
> > > charakter "source failure indicator symbol" (SFIS)) was intended
> > > especaly for mall-formed utf-8 input that has overlong encodings.
> > This is a special, custom form of error handling - why assign 
> > a character for it?
> Converting from and to utf-8 is an all-day topic, very important
> for all applications handling with unicode. So it is a special
> case, but very common.
> Therefore it would be nice to have a standardized - application
> independend - error handling for it. Also it is a mechanism
> useful for many other charsets beeing converted do unicode.

I've got to agree with Markus here. Among other things, encoding
a character which means "conversion failure occurred here" and
then embedding it in converted text is just a generic and
not very informative way of *representing* a conversion failure.
The actual error handling would still end up being up to the
application, every bit as much as what an application does
today with a U+FFFD in Unicode text is application-specific.

Adding this kind of character would then also complicate the
task of people trying to figure out how to write convertors,
since they would then be scratching their heads to distinguish
between cases which warrant use of U+FFFD and those which
warrant this new SFIS instead. Maybe the distinction seems clear
to you, but I suspect that in practice people will become
confused about the distinctions, and there will be troubling
edge cases.

In the particular case of UTF-8, I would consider such a
mechanism nothing more than an attempted endrun around the
tightened definition of UTF-8. It provides another path
whereby ill-formed UTF-8 could get converted and then end
up being interpreted by some process that doesn't know
the difference. In other words, it carries the risk of
reintroducing the security issue that we've been trying to
get legislated away, by finding a way to make it "o.k." to
interpret non-shortest UTF-8.

> > You could just use an existing character or non-character for 
> > this, e.g., U+303E or U+FFFF or U+FDEF or similar.
> This is what I do meanwhile. But it's uncomfortable, because
> most editors display all non-characters, unassigned characters
> or charakters not in the font all the same way - which hides
> the INDICATION. The SFIS should be displayed to remind the reader
> only THIS is a SFIS unlike all the other empty suqares in the
> text.

Your suggested encoding U+FFF8 wouldn't work this way, by the
way. U+FFF8 is reserved for format control characters -- and
those characters display *invisibly* by default -- not as
an empty square (or other fallback glyph) like miscellaneous
symbols which happen not to be in your fonts.

I think Marku's suggestion is correct. If you want to do
something like this internally to a process, use a noncharacter
code point for it. If you want to have visible display of this
kind of error handling for conversion, then simply declare a
convention for the use of an already existing character.
My suggestion would be: U+2620. ;-) Then get people to share
your convention.

I'm not intending to be facetious here, by the way. One problem
that character encoding runs into is that there are plenty
of people with good ideas for encoding meanings or functions,
and those ideas can end up turning into requests to encode
some invented character just for that meaning or function.
For example, I might decide that it was a good idea to have
a symbol by which I could mark a following date string as
indicating a death date--that would be handy for bibliographies
and other reference works. Now I could come to the Unicode
Consortium and ask for encoding of U+XXXX DEATH DATE SYMBOL,
or I could instead discover that U+2020 DAGGER is already used
in that meaning for some conventions. There are *plenty* of
symbol characters available in Unicode -- way more than in
any other character encoding standard. And it is a much
lighter-weight process to establish a convention for use
of an existing symbol character than it is to encode a new character
specifically for that meaning/function and then force everyone
to implement it as a new character.

> Additional I think we should have a standardized way to display
> old utf-8 text without losing information (overlong utf-8 was
> allowed for years) 

Not really. And in any case, there is nothing to be gained
here by "displaying old utf-8 text without losing information".
The way to deal with that is to *filter* it into legal
UTF-8 text, by means of an explicit process designed to
recover what would otherwise we rejected as illegal data.

> - gyphing is not a fine way and simply
> decoding the overlong forms is not allowed. This is a self-made
> problem, so unicode should provide an inherent way to solve it.

There are plenty of ways to solve these things -- by API design
or by specialized conversions designed to deal with otherwise
unrepresentable data. But trying to bake conversion error
representation into the character encoding itself is, in
my opinion, an error in itself.

--Ken



Reply via email to