Hi Philippe,

Philippe Meunier wrote on Wed, Nov 29, 2017 at 09:11:38AM -0500:

> I've noticed something unexpected when copy-pasting UTF-8 characters in
> xterm: xterm seems to change some of the characters into something
> different but visually similar.  Here's an example (using ksh):
> 
> $ uname -a
> OpenBSD foo.my.domain 6.1 GENERIC#19 i386
> $ ls
> Thérèse

That's a bad idea.  Do not use non-ASCII bytes in file names.
You are in for all kinds of trouble.  Not so much because using
arbitrary bytes in file names would be invalid, but because their
meaning is completely undefined on any UNIX-like operating system.

By definition, file names are byte strings, not character strings.
They do NOT have a meaning in any particular locale and are NOT
representing accented characters.

In this respect, OpenBSD is better than other operating systems.
The problem is mostly hidden on OpenBSD because OpenBSD supports
UTF-8 only.  So if you use UTF-8 characters in file names, you often
get away with it simply because it's the only locale supported by
the system.  But, as you see, even on OpenBSD, you do not always
get away with such recklessness.

On other systems supporting different locales, each user can choose
their own locale, so one user may have UTF-8 set, another one
ISO-LATIN-something, and yet another one Shift-JIS.  But there is
only one file system.  So every filename will be gibberish for all
users except for the one user having a locale where it happens to
be validly encoded.

Speak after me:  A file system does not have a locale.  Non-ASCII
characters cannot be encoded in file names, on any UNIX in general.
(Windows is different, but at the price of badly violating POSIX
in significant parts of its C library).

> $ ls | od -c
> 0000000    T   h   e 314 201   r   e 314 200   s   e  \n                
> 0000014
> $ cp Thérèse Thérèse
> 
> This copy command is typed as follows: type 'cp ', press tab for ksh to
> auto-complete the first filename, another space, then use the mouse to
> copy-paste the first filename into xterm to get the second filename.
> The cp command works without any error.  The result is:

   $ printf "\xcc\x81" | uniname   
  character  byte       UTF-32   encoded as     glyph   name
          0          0  000065   65             e      LATIN SMALL LETTER E
          1          1  000301   CC 81                 COMBINING ACUTE ACCENT
   $ printf "\xc3\xa9" | uniname 
  character  byte       UTF-32   encoded as     glyph   name
          0          0  0000E9   C3 A9 \
  LATIN SMALL LETTER E WITH ACUTE

That's called "canonical composition" in Unicode.
The UTF-8 multibyte character sequences "e\xcc\x81" and "\xc3\xa9"
are canonically equivalent, which means that multibyte-character
aware software is required to treat both identically, and such
software is allowed to silently substitute one for the other.

Of course, the file system is not multibyte-character aware and not
allowed to be, so as a file name, both names are different.

Yes, you heard correctly: Not only can filenames containing
*semantically different* Unicode characters have identical visual
representation, but the filesystem is also required to treat filenames
as different that have *identical* semantics in Unicode.

Do not use Unicode for filenames.  It simply doesn't work and is
a security nightmare on top of that.

The reason for UTF-8 support in ls(1) isn't to encourage UTF-8
filenames.  It is merely a crutch helping to display as much
information as possible about broken file systems.  They are still
broken and dangerous.

> So it looks like xterm is changing

I'm not convinced it is xterm; it might also be the X libraries
supporting copying with the mouse.  Anyway, whatever does it is
allowed to.

It's certainly not ksh(1) because our ksh is not fully multibyte-
character aware on purpose, but deliberately has only limited
multibyte-character support.  We want predictable, not surprising
behaviour in the shell.  In particular, our ksh never changes byte
sequences.

Yours,
  Ingo

Reply via email to