Eli Zaretskii <e...@gnu.org>: >> From: Marko Rauhamaa <ma...@pacujo.net> >> By setting the character set artificially to Latin-1 in Guile, all >> pathnames are accessible to it. > > No, they aren't, not as file names. E.g., you cannot meaningfully > downcase or upcase such "characters", you cannot count characters (as > opposed to bytes), you cannot calculate how much screen estate will be > needed to display them, with some Far Eastern encodings you cannot > correctly search them for some specific ASCII characters (because they > can be part of a multibyte sequence), etc. etc. IOW, you cannot work > with file names as human-readable text, which is something many > programs need to do.
You can, in a roundabout way. You do the low-level file I/O in Latin-1. Then, you reencode into UTF-8, and if you get an exception, you deal with the situation. Otherwise, you may not even be able to remove a file with a non-UTF-8 name. > File names _are_ strings, there's no way around that. Linux pathnames are classic C strings. > They are strings because _people_ name files and give them meaningful > names and extensions. The Linux kernel just doesn't care, and shouldn't. It's acceptable for Guile to create a higher-level illusion, but it shouldn't sacrifice completeness while doing so. You should be able to manipulate every conceivable filename from Guile code. (Python 3.x accepts bytevectors as well as strings everywhere. For example, listing a directory returns strings if the directory name is given as a string. It returns bytevectors if the directory name is given as a bytevector. Python's bytevector literals accept ASCII, which makes this rather convenient.) > If Guile cannot easily work with file names encoded in a codeset other > than the current locale's one, then Guile should be extended to allow > a program to tell it in which encoding to interpret a particular name. A program usually has no clue how a pathname has been encoded. > (I think Guile already supports that, but maybe I misremember.) But > lobbying for treating file names as byte streams, let alone Latin-1 > characters, is a large step backwards, to 1990s when we didn't know > better. We've come a long way since then and learned a lot on the way. At least our backwardness allowed Linux to jump directly to UTF-8 and not be afflicted by UCS-2 like Windows and Java. I'm not saying bytevectors are elegant, but we should not replace them with wishful thinking. Ideally, we should have a bijective bytevector-to-string mapping. (Python 3.x uses Unicode surrogate code points for that purpose but doesn't quite achieve bijection, unfortunately.) I'm a bit sorry that Guile repeated Python 3's mistake and brought (Unicode) strings to the center. Strings are a highly special-purpose data structure; I really never had a real need for them in my decades of programming. Also, I suspect strings are much too simplistic for any serious typesetting or GUI work. It seems the sweet spot of strings are text/plain mail messages and Usenet postings. Guile 1.x's and Python 2.x's bytevector/string confusion was actually a very happy medium. Neither the OS nor the programming language placed any interpretation to the byte sequences. That was left to the application. Marko