Re: Running script from directory with UTF-8 characters

Marko Rauhamaa Tue, 22 Dec 2015 13:39:45 -0800

Eli Zaretskii <e...@gnu.org>:

>> From: Marko Rauhamaa <ma...@pacujo.net>
>> By setting the character set artificially to Latin-1 in Guile, all
>> pathnames are accessible to it.
>
> No, they aren't, not as file names. E.g., you cannot meaningfully
> downcase or upcase such "characters", you cannot count characters (as
> opposed to bytes), you cannot calculate how much screen estate will be
> needed to display them, with some Far Eastern encodings you cannot
> correctly search them for some specific ASCII characters (because they
> can be part of a multibyte sequence), etc. etc. IOW, you cannot work
> with file names as human-readable text, which is something many
> programs need to do.


You can, in a roundabout way. You do the low-level file I/O in Latin-1.
Then, you reencode into UTF-8, and if you get an exception, you deal
with the situation.

Otherwise, you may not even be able to remove a file with a non-UTF-8
name.

> File names _are_ strings, there's no way around that.

Linux pathnames are classic C strings.

> They are strings because _people_ name files and give them meaningful
> names and extensions.

The Linux kernel just doesn't care, and shouldn't.

It's acceptable for Guile to create a higher-level illusion, but it
shouldn't sacrifice completeness while doing so. You should be able to
manipulate every conceivable filename from Guile code.

(Python 3.x accepts bytevectors as well as strings everywhere. For
example, listing a directory returns strings if the directory name is
given as a string. It returns bytevectors if the directory name is given
as a bytevector. Python's bytevector literals accept ASCII, which makes
this rather convenient.)

> If Guile cannot easily work with file names encoded in a codeset other
> than the current locale's one, then Guile should be extended to allow
> a program to tell it in which encoding to interpret a particular name.

A program usually has no clue how a pathname has been encoded.

> (I think Guile already supports that, but maybe I misremember.) But
> lobbying for treating file names as byte streams, let alone Latin-1
> characters, is a large step backwards, to 1990s when we didn't know
> better. We've come a long way since then and learned a lot on the way.

At least our backwardness allowed Linux to jump directly to UTF-8 and
not be afflicted by UCS-2 like Windows and Java.

I'm not saying bytevectors are elegant, but we should not replace them
with wishful thinking. Ideally, we should have a bijective
bytevector-to-string mapping. (Python 3.x uses Unicode surrogate code
points for that purpose but doesn't quite achieve bijection,
unfortunately.)

I'm a bit sorry that Guile repeated Python 3's mistake and brought
(Unicode) strings to the center. Strings are a highly special-purpose
data structure; I really never had a real need for them in my decades of
programming. Also, I suspect strings are much too simplistic for any
serious typesetting or GUI work. It seems the sweet spot of strings are
text/plain mail messages and Usenet postings.

Guile 1.x's and Python 2.x's bytevector/string confusion was actually
a very happy medium. Neither the OS nor the programming language placed
any interpretation to the byte sequences. That was left to the
application.


Marko

Re: Running script from directory with UTF-8 characters

Reply via email to