> Why would anyone, faced with a UTF-8 file that contains invalid > sequences, want to retain the invalid sequences, much less convert the > file to another encoding form that either (a) preserves the invalid > sequences or (b) leaves a marker showing where they were? Invalid > sequences are garbage. They don't represent anything, and you can't > always even tell what they were supposed to represent. > I need to store UNIX filenames in a UTF-16 database residing on Windows. If I use ANSI->Unicode, there is no problem. However, if I have a filesystem with filenames mainly in UTF-8? Nobody can guarantee that all of them will be in UTF-8. Some may still be in ANSI (well ISO). Actually, at some point in time, there will be UNIX servers with 50% of filenames in UTF-8 and 50% in ANSI (or something else for that matter).
Hence my example of "ls > ls.out". My requirement is that there can be no data loss. And so far so good. If I could keep the data in a UTF-8 database, everything would be fine. But I cannot convert this ls.out file to UTF-16 because I cannot guarantee that I will always be able to get the original data back. I am not the first to face this problem. Actually Markus Kuhn suggested a solution back in year 2000. And he told me that Emacs deals with illegal UTF-8 sequences by using the 0x200080 to 0x2000ff range for them. Well, it works for UTF-32 (or is it UCS-4?), but NOT for UTF-16, which is why I insist that UTF-16 is kept inferior by the unicoders who keep saying that illegal sequences are not Unicode's problem. There is a solution. Probably more than one. I think we should be discussing which is the best one rather than discussing who's problem the illegal sequences are. In my case illegal sequences are not something a faulty software introduced. They are real data. Illegal data? Well, maybe it will sound better if I say irregular sequences and irregular data. And actually, yes, this is irregular data - a piece of ANSI encoded data hidden among lots of UTF-8 data. OK, based on David Hopwood's answer, I assume that UTF-8B is what I need to use. What I wanted to verify is: A - UTF-8B is only one of the possible solutions. I wanted to know that UTF-8B (using the unpaired surrogates and not something else) is sufficiently recognised and that no other solution is gaining populatiry. B - Once it is clear that UTF-8B is a good solution (and that there is a substantial need for it), it should find some way into the Unicode. Generally acknowledging that such conversions exist and that they are allowed under certain conditions is a good thing. There may be reasons for UTF-8B not to be a part of the Unicode standard itself, but I think then it should be defined and mentioned in a UTR. Why risk various implementations if there already is one that is IMHO very good. Lars Kristan Storage & Data Management Lab HERMES SoftLab