RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Lars Kristan Sat, 11 Dec 2004 03:47:42 -0800

Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Arcane Jill responded:
> >> Windows filesystems do know what encoding they use.
> >Err, not really. MS-DOS *need to know* the encoding to use,
> >a bit like a
> >*nix application that displays filenames need to know the
> >encoding to use
> >the correct set of glyphs (but constrainst are much more heavy.)
>
> Sure, but MS-DOS is not Windows. MS-DOS uses "8.3" filenames.
> But it's not
> like MS-DOS is still terrifically popular these days.
I don't know what Antoine meant by MS-DOS, but since he mentioned it in the Windows context, I thought it was about Windows console applications (console is still often referred to as DOS box, I think).

> The fact that applications can still open files using the
> legacy fopen()
> call (which requires char*, hence 8-bit-wide, strings) is kind of
> irrelevant. If the user creates a file using fopen() via a code page
> translation, AND GETS IT WRONG, then the file will be created
> with Unicode
> characters other than those she - but those characters will
> still be Unicode
> and unambiguous, no?
Funny thing. Nobody cares much if a Latin 2 string is misinterpreted and Latin 1 conversion is used instead. As long as they can create the file. But if a Latin 2 string is misinterpreted and UTF-8 conversion is used? You won't just get the filename with charaters other than those you expected. Either the file won't open at all (depending on where and how the validation is done), or you risk that two files you create one after another will overwrite each other. Note that I am talking about files you create from within this scenario, not files that existed on the disk before.

Second thing: OK, you say fopen is a legacy call. True, you can use _wfopen. So, you can have a console application in Unicode and all problems are solved? No. Standard input and standard output are 8-bit, and a code page is used. And it has to remain so, if you want the old and the new applications to be able to communicate. So, the logical conclusion is that UTF-8 needs to be used instead of a code page. Unfortunately, Windows has problems with that. Try MODE CON: CP SELECT=65001. Much of it works, but batch files don't run.

Now suppose Windows does work correctly with code page set to UTF-8. You create an application that reads the stdin, counts the words longer than 10 codepoints and passes the input unmodified to stdout. What happens:

* set CP to Latin 1, process Latin 1: correct result
* set CP to Latin 1, process UTF-8:   wrong result
* set CP to UTF-8, process UTF-8:     correct result
* set CP to UTF-8, process Latin 1:   wrong restlt, corrupted output

Now, I wonder why Windows is not supporting UTF-8 as much as one would want.....

Lars

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Reply via email to