> Why would anyone, faced with a UTF-8 file that contains invalid
> sequences, want to retain the invalid sequences, much less convert the
> file to another encoding form that either (a) preserves the invalid
> sequences or (b) leaves a marker showing where they were?  Invalid
> sequences are garbage.  They don't represent anything, and you can't
> always even tell what they were supposed to represent.
> 
I need to store UNIX filenames in a UTF-16 database residing on Windows. If
I use ANSI->Unicode, there is no problem. However, if I have a filesystem
with filenames mainly in UTF-8? Nobody can guarantee that all of them will
be in UTF-8. Some may still be in ANSI (well ISO). Actually, at some point
in time, there will be UNIX servers with 50% of filenames in UTF-8 and 50%
in ANSI (or something else for that matter).

Hence my example of "ls > ls.out". My requirement is that there can be no
data loss. And so far so good. If I could keep the data in a UTF-8 database,
everything would be fine. But I cannot convert this ls.out file to UTF-16
because I cannot guarantee that I will always be able to get the original
data back.

I am not the first to face this problem. Actually Markus Kuhn suggested a
solution back in year 2000. And he told me that Emacs deals with illegal
UTF-8 sequences by using the 0x200080 to 0x2000ff range for them. Well, it
works for UTF-32 (or is it UCS-4?), but NOT for UTF-16, which is why I
insist that UTF-16 is kept inferior by the unicoders who keep saying that
illegal sequences are not Unicode's problem.

There is a solution. Probably more than one. I think we should be discussing
which is the best one rather than discussing who's problem the illegal
sequences are. In my case illegal sequences are not something a faulty
software introduced. They are real data. Illegal data? Well, maybe it will
sound better if I say irregular sequences and irregular data. And actually,
yes, this is irregular data - a piece of ANSI encoded data hidden among lots
of UTF-8 data.


OK, based on David Hopwood's answer, I assume that UTF-8B is what I need to
use. What I wanted to verify is:
A - UTF-8B is only one of the possible solutions. I wanted to know that
UTF-8B (using the unpaired surrogates and not something else) is
sufficiently recognised and that no other solution is gaining populatiry.
B - Once it is clear that UTF-8B is a good solution (and that there is a
substantial need for it), it should find some way into the Unicode.
Generally acknowledging that such conversions exist and that they are
allowed under certain conditions is a good thing. There may be reasons for
UTF-8B not to be a part of the Unicode standard itself, but I think then it
should be defined and mentioned in a UTR. Why risk various implementations
if there already is one that is IMHO very good.


Lars Kristan
Storage & Data Management Lab
HERMES SoftLab

Reply via email to