On Fri, May 30, 2014 at 09:18:19PM +0200, Harald Becker wrote: > Hi Rich ! > > >My statement was imprecise; of course to support users still > >stuck on legacy locales, nl_langinfo(CODESET) should be > >consulted. > > How do you determine the correct code set of a foreign file > system on an external drive? How can you tell if all systems > which accessed this drive has handled translations in the correct > way?
All modern filesystems used on external devices (fat32, ntfs, udf, ...) use Unicode-based encodings for filenames, so the foreign encoding is known and fixed. > >> .... and not only unzip may produce such results. Think of > >> using an USB stick at an Windows machine, then carry that over > >> to an Linux machine. > > > >The filenames are stored in UCS-2. No problem. > > UCS-2 with different code page translations from an 8 bit > charset. Translations which leave name mapping in inconsistent > state when further translations occur. I don't follow what you think the problem is. > >If you mount it incorrectly, then this is user error. > > Correct, all those trouble arrives due to anybody having an > incorrect setup. This will ripple trough and may produce trouble > on other ends. All modern Linux-based systems use the utf8 option by default when mounting filesystems that don't store filenames as pure byte strings but in a Unicode-based form. You have to be rolling your own or else actively breaking your system's default setup to get this wrong. > >All programs are not affected. Only programs which read > >filenames as byte strings from foreign sources (such as the > >directory table of a zip file) are affected. > > .... but how do you know the code page the zip archive uses. How > do you know you need to do translations? I'm unsure if the archiv > contains this information, so it needs to be provided by a much > more error prone user. When encountering such an archive, the unzip utility could simply exit with an error when there are non-ASCII names unless the user specifies the encoding. To be less error-prone, it could print the names as interpreted in several different encodings as part of the error message, to help the user identify which one is correct. IMO it should also automatically assume UTF-8 and suppress the error condition if the names all decode as valid UTF-8, since the probability of meaningful non-UTF-8 text decoding successful as UTF-8 is negligible. Rich _______________________________________________ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox