Follow-up Comment #12, bug #65108 (group groff): [comment #11 comment #11:] > [comment #0 original submission:] > > we have no way of knowing what the file system's character encoding is. > > Might be ISO 8859-1, UTF-8, UTF-16BE/LE, or something else entirely. > > I'm not sure now if that's a meaningful question. The file system seems to just store a string of bytes as the file name, and leave it up to the shell how to interpret that.
> $ mkdir foo
> $ cd foo
> $ echo résumé | iconv -tutf8 | xargs touch
> $ echo résumé | iconv -tlatin1 | xargs touch
> $ echo * | od -c
> 0000000 r 303 251 s u m 303 251 r 351 s u m 351 \n
> 0000020
> Then a UTF-8 shell produces:
> $ ls
> résumé 'r'$'\351''sum'$'\351'
> and a Latin-1 shell produces:
> $ ls
> résumé résumé
> That is, both filenames are valid (but different) strings of Latin-1
characters. In UTF-8, one of them is a string of valid characters, and one
has two invalid bytes in it.
It's also valid Latin-2, Latin-5, Latin-9, and KOI8-R, to name four other
encodings supported by _groff_.
> This is an ext4 file system, but I would imagine any other Unix-based one
would have to work the same in order to interact with shells consistently.
I feel like we're saying the same thing, or compatible things.
A file named "résumé1.ms" might be stored on the file system using either
character encoding, or, on a Widows system, using UTF-16LE. A _groff_ user
with a document that wants to `so` that file name:
$ grep -F .so résumé.ms
.so résumé1.ms
.so résumé2.ms
.so résumé3.ms
...is going to need either an encoding match between résumé.ms's contents
and their file system, or some sophistication about character encodings.
That's why I want to be able to support:
$ grep -F .so résumé.ms
.so r\[u00E9]sum\[u00E9]1.ms
.so r\[u00E9]sum\[u00E9]2.ms
.so r\[u00E9]sum\[u00E9]3.ms
That way a person doesn't have to _preconv_ their document.
Or *did* _preconv_ their document and this is what the program left them with
because that tool has no sense of context regarding requests that take file
name arguments: `so`, `soquiet`, `mso`, `msoquiet`, `open`, `opena`, `psbb`,
`cf`, `fp`, `hpf`, `hpfa`, `nx`, or `trf`.
I feel like we might be talking past each other...?
_______________________________________________________
Reply to this item at:
<https://savannah.gnu.org/bugs/?65108>
_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
signature.asc
Description: PGP signature
