Re: xterm(1) changing UTF-8 characters when copy-pasting?

Ingo Schwarze Wed, 29 Nov 2017 10:05:59 -0800

Hi Anthony,

Anthony J. Bentley wrote on Wed, Nov 29, 2017 at 10:29:28AM -0700:
> Ingo Schwarze writes:


>> That's a bad idea.  Do not use non-ASCII bytes in file names.
>> You are in for all kinds of trouble.

> I don't agree. In a situation where a single user will be accessing
> files,

That's a very strong condition, which will rarely hold.  But sure,
when it does hold, and when the number of files is too large to
assign sensible file names, it partially mitigates the problems.
But only partially.

> you can use whatever naming scheme you like. UTF-8 works exactly
> how you would expect: the filename you enter is the filename you'll
> get.

Until some program from ports decides to legitimately do Unicode
normalization, uses buggy built-in locale components, assumes the
wrong locale, or incorrectly validates character encoding and crashes
or truncates data.  Just as a few examples of what can still go
wrong even on a purely single-user system.  All these are fairly
widespread in the wild.  Quite certainly, xterm is not the only
program doing normalization, and i have rarely seen any program
that is not buggy with respect to multibyte-character handling.

> Misencoded files can also exist, with exactly the results you would
> expect also: you can't necessarily type it, but if you can pass the
> exact filename, programs will work.

Except those using fgetws(3), mbtowc(3), mbstowcs(3), and friends
for reading UTF-8 data and terminating on encoding errors, which
includes for example almost all of the FreeBSD base system, including
POSIX utilities like cut(1).

[...]
> This is indeed xterm's fault.
> 
>   precompose (class Precompose)
>     Tells xterm whether to precompose UTF-8 data into Normalization
>     Form C, which combines commonly-used accents onto base
>     characters.  If it does not do this, accents are left as
>     separatate characters.  The default is "true".
> 
> In my opinion, that's a *very* poor default. I don't expect base tools
> to canonicalize text like that.

Base tools certainly shouldn't.  In my opinion, if Xenocara wouldn't,
that would be an improvement, too.  In particular in much-used tools
like xterm(1).  Even if that causes us to diverge a bit from upstream.

> The only unexpected thing here is xterm doing these transformations
> without asking.

I think i would support a diff to fix that near the end of

  /usr/X11R6/share/X11/app-defaults/XTerm  ==
  /usr/xenocara/app/xterm/XTerm.ad

Thanks for digging up the root cause of the OP's issue.

Yours,
  Ingo

Re: xterm(1) changing UTF-8 characters when copy-pasting?

Reply via email to