Dominikus Scherkl replied to Markus: > > > My other suggestion (and the main reason to call the proposed > > > charakter "source failure indicator symbol" (SFIS)) was intended > > > especaly for mall-formed utf-8 input that has overlong encodings. > > This is a special, custom form of error handling - why assign > > a character for it? > Converting from and to utf-8 is an all-day topic, very important > for all applications handling with unicode. So it is a special > case, but very common. > Therefore it would be nice to have a standardized - application > independend - error handling for it. Also it is a mechanism > useful for many other charsets beeing converted do unicode.
I've got to agree with Markus here. Among other things, encoding a character which means "conversion failure occurred here" and then embedding it in converted text is just a generic and not very informative way of *representing* a conversion failure. The actual error handling would still end up being up to the application, every bit as much as what an application does today with a U+FFFD in Unicode text is application-specific. Adding this kind of character would then also complicate the task of people trying to figure out how to write convertors, since they would then be scratching their heads to distinguish between cases which warrant use of U+FFFD and those which warrant this new SFIS instead. Maybe the distinction seems clear to you, but I suspect that in practice people will become confused about the distinctions, and there will be troubling edge cases. In the particular case of UTF-8, I would consider such a mechanism nothing more than an attempted endrun around the tightened definition of UTF-8. It provides another path whereby ill-formed UTF-8 could get converted and then end up being interpreted by some process that doesn't know the difference. In other words, it carries the risk of reintroducing the security issue that we've been trying to get legislated away, by finding a way to make it "o.k." to interpret non-shortest UTF-8. > > You could just use an existing character or non-character for > > this, e.g., U+303E or U+FFFF or U+FDEF or similar. > This is what I do meanwhile. But it's uncomfortable, because > most editors display all non-characters, unassigned characters > or charakters not in the font all the same way - which hides > the INDICATION. The SFIS should be displayed to remind the reader > only THIS is a SFIS unlike all the other empty suqares in the > text. Your suggested encoding U+FFF8 wouldn't work this way, by the way. U+FFF8 is reserved for format control characters -- and those characters display *invisibly* by default -- not as an empty square (or other fallback glyph) like miscellaneous symbols which happen not to be in your fonts. I think Marku's suggestion is correct. If you want to do something like this internally to a process, use a noncharacter code point for it. If you want to have visible display of this kind of error handling for conversion, then simply declare a convention for the use of an already existing character. My suggestion would be: U+2620. ;-) Then get people to share your convention. I'm not intending to be facetious here, by the way. One problem that character encoding runs into is that there are plenty of people with good ideas for encoding meanings or functions, and those ideas can end up turning into requests to encode some invented character just for that meaning or function. For example, I might decide that it was a good idea to have a symbol by which I could mark a following date string as indicating a death date--that would be handy for bibliographies and other reference works. Now I could come to the Unicode Consortium and ask for encoding of U+XXXX DEATH DATE SYMBOL, or I could instead discover that U+2020 DAGGER is already used in that meaning for some conventions. There are *plenty* of symbol characters available in Unicode -- way more than in any other character encoding standard. And it is a much lighter-weight process to establish a convention for use of an existing symbol character than it is to encode a new character specifically for that meaning/function and then force everyone to implement it as a new character. > Additional I think we should have a standardized way to display > old utf-8 text without losing information (overlong utf-8 was > allowed for years) Not really. And in any case, there is nothing to be gained here by "displaying old utf-8 text without losing information". The way to deal with that is to *filter* it into legal UTF-8 text, by means of an explicit process designed to recover what would otherwise we rejected as illegal data. > - gyphing is not a fine way and simply > decoding the overlong forms is not allowed. This is a self-made > problem, so unicode should provide an inherent way to solve it. There are plenty of ways to solve these things -- by API design or by specialized conversions designed to deal with otherwise unrepresentable data. But trying to bake conversion error representation into the character encoding itself is, in my opinion, an error in itself. --Ken