Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-11-29 Thread Ingo Schwarze
Hi Philippe,

Philippe Meunier wrote on Wed, Nov 29, 2017 at 09:11:38AM -0500:

> I've noticed something unexpected when copy-pasting UTF-8 characters in
> xterm: xterm seems to change some of the characters into something
> different but visually similar.  Here's an example (using ksh):
> 
> $ uname -a
> OpenBSD foo.my.domain 6.1 GENERIC#19 i386
> $ ls
> Thérèse

That's a bad idea.  Do not use non-ASCII bytes in file names.
You are in for all kinds of trouble.  Not so much because using
arbitrary bytes in file names would be invalid, but because their
meaning is completely undefined on any UNIX-like operating system.

By definition, file names are byte strings, not character strings.
They do NOT have a meaning in any particular locale and are NOT
representing accented characters.

In this respect, OpenBSD is better than other operating systems.
The problem is mostly hidden on OpenBSD because OpenBSD supports
UTF-8 only.  So if you use UTF-8 characters in file names, you often
get away with it simply because it's the only locale supported by
the system.  But, as you see, even on OpenBSD, you do not always
get away with such recklessness.

On other systems supporting different locales, each user can choose
their own locale, so one user may have UTF-8 set, another one
ISO-LATIN-something, and yet another one Shift-JIS.  But there is
only one file system.  So every filename will be gibberish for all
users except for the one user having a locale where it happens to
be validly encoded.

Speak after me:  A file system does not have a locale.  Non-ASCII
characters cannot be encoded in file names, on any UNIX in general.
(Windows is different, but at the price of badly violating POSIX
in significant parts of its C library).

> $ ls | od -c
> 000T   h   e 314 201   r   e 314 200   s   e  \n
> 014
> $ cp Thérèse Thérèse
> 
> This copy command is typed as follows: type 'cp ', press tab for ksh to
> auto-complete the first filename, another space, then use the mouse to
> copy-paste the first filename into xterm to get the second filename.
> The cp command works without any error.  The result is:

   $ printf "\xcc\x81" | uniname   
  character  byte   UTF-32   encoded as glyph   name
  0  0  65   65 e  LATIN SMALL LETTER E
  1  1  000301   CC 81 COMBINING ACUTE ACCENT
   $ printf "\xc3\xa9" | uniname 
  character  byte   UTF-32   encoded as glyph   name
  0  0  E9   C3 A9 \
  LATIN SMALL LETTER E WITH ACUTE

That's called "canonical composition" in Unicode.
The UTF-8 multibyte character sequences "e\xcc\x81" and "\xc3\xa9"
are canonically equivalent, which means that multibyte-character
aware software is required to treat both identically, and such
software is allowed to silently substitute one for the other.

Of course, the file system is not multibyte-character aware and not
allowed to be, so as a file name, both names are different.

Yes, you heard correctly: Not only can filenames containing
*semantically different* Unicode characters have identical visual
representation, but the filesystem is also required to treat filenames
as different that have *identical* semantics in Unicode.

Do not use Unicode for filenames.  It simply doesn't work and is
a security nightmare on top of that.

The reason for UTF-8 support in ls(1) isn't to encourage UTF-8
filenames.  It is merely a crutch helping to display as much
information as possible about broken file systems.  They are still
broken and dangerous.

> So it looks like xterm is changing

I'm not convinced it is xterm; it might also be the X libraries
supporting copying with the mouse.  Anyway, whatever does it is
allowed to.

It's certainly not ksh(1) because our ksh is not fully multibyte-
character aware on purpose, but deliberately has only limited
multibyte-character support.  We want predictable, not surprising
behaviour in the shell.  In particular, our ksh never changes byte
sequences.

Yours,
  Ingo



Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-11-29 Thread Philippe Meunier
Ingo Schwarze wrote:
>Philippe Meunier wrote:
>> $ ls
>> Thérèse
>
>That's a bad idea.  Do not use non-ASCII bytes in file names.

That's a nice thought but in practice I have some files on that machine
with names written in French, Thai, Chinese, Korean, and Japanese, and for
some of these files renaming is not an option for work reasons.  I somehow
doubt that I'm the only one in such a situation.

>In this respect, OpenBSD is better than other operating systems.
>The problem is mostly hidden on OpenBSD because OpenBSD supports
>UTF-8 only.

Yes, I've noticed that the UTF-8 support in OpenBSD has become much nicer
in recent years.  My thanks to the devs who did that :-)

>That's called "canonical composition" in Unicode.

*sigh*  I see.  Well, I learned something new today.  Thanks for the info.

>It's certainly not ksh(1) because our ksh is not fully multibyte-
>character aware on purpose, but deliberately has only limited
>multibyte-character support.

Actually, since you brought this up, I wish ksh had fuller multibyte
character support.  As you say above the problem is mostly hidden and most
of the time it happens to just work, but, for example, trying to delete
double-wide Korean characters (well, syllables, really, which are *all*
double-wide) messes up the command line: the double-wide characters are
correctly deleted but the cursor moves left by only one position for each
delete which means that very quickly I lose track of which characters I'm
actually deleting and I'm forced to redraw the line.  Anyway, at this point
it's mostly anecdotal; most things work out of the box.

Philippe




Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-11-29 Thread Ingo Schwarze
Hi Philippe,

Philippe Meunier wrote on Wed, Nov 29, 2017 at 11:35:59AM -0500:
> Ingo Schwarze wrote:
>> Philippe Meunier wrote:

>>> $ ls
>>> Thérèse

>> That's a bad idea.  Do not use non-ASCII bytes in file names.

> That's a nice thought but in practice I have some files on that machine
> with names written in French, Thai, Chinese, Korean, and Japanese, and for
> some of these files renaming is not an option for work reasons.  I somehow
> doubt that I'm the only one in such a situation.

Sure.  In some situations, there is no viable alternative to dealing
with file systems containing broken filenames.  That's why we try
to make tools like ls(1) as useful as possible in such a bad
situation.  But you can never expect a smooth user experience.
It is not an OpenBSD-specific problem, in facts it's worse almost
everywhere else, although not everybody is likely to admit that.

>> It's certainly not ksh(1) because our ksh is not fully multibyte-
>> character aware on purpose, but deliberately has only limited
>> multibyte-character support.

> Actually, since you brought this up, I wish ksh had fuller multibyte
> character support.  As you say above the problem is mostly hidden and most
> of the time it happens to just work, but, for example, trying to delete
> double-wide Korean characters (well, syllables, really, which are *all*
> double-wide) messes up the command line:

That is indeed expected, and it is one of the things that are very
unlikely to change even in the long term.  Adding support for
correctly handling character display widths in shell command line
editing would require calling functions like mbtowc(3) and wcwidth(3)
on the fly in the command line editing modules.  Such changes would
be fairly intrusive and carry a substantial risk of introducing
nasty, perhaps even security-relevant bugs into the shell, so even
if somebody would cook up patches, i'm not convinced that they could
go in.

That said, i see that you are actually torturing our shell in these
respects quite a bit.  As long as you don't expect that everything
can be fixed, you are quite welcome to report issues that you see.
I don't doubt that there are still outright bugs, and it also seems
likely that there are missing features which can be implemented
without making a mess of the shell.  So reports based on real everyday
use are definitely helpful.  While several developers understand
the basics of how multibyte character support works in the shell
and in some others of our POSIX utilities, very few use that heavily,
as far as i know.

Yours,
  Ingo



Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-11-29 Thread Anthony J. Bentley
Ingo Schwarze writes:
> That's a bad idea.  Do not use non-ASCII bytes in file names.
> You are in for all kinds of trouble.

I don't agree. In a situation where a single user will be accessing
files, you can use whatever naming scheme you like. UTF-8 works exactly
how you would expect: the filename you enter is the filename you'll get.

Misencoded files can also exist, with exactly the results you would
expect also: you can't necessarily type it, but if you can pass the
exact filename, programs will work. Same goes with control characters
like backspaces in file names (far more annoying than UTF-8).

Saying you can't is impractical. Anyone downloading lots of external
files through web browsers, torrent clients, or any number of other
programs in ports will eventually encounter files with UTF-8 filenames.
They work just fine. Keeping spaces out of filenames is already a lost
battle, let alone limiting them to the POSIX portable filename character
set (A-Za-z0-9._-).

Obviously once you start talking about files on external media or
otherwise accessible by users in other locales, that conclusion changes.
But I'm talking about a personal desktop here.

> > So it looks like xterm is changing
>
> I'm not convinced it is xterm; it might also be the X libraries
> supporting copying with the mouse.  Anyway, whatever does it is
> allowed to.

This is indeed xterm's fault.

   precompose (class Precompose)
   Tells xterm whether to precompose UTF-8 data into Normalization
   Form C, which combines commonly-used accents onto base
   characters.  If it does not do this, accents are left as
   separatate characters.  The default is "true".

In my opinion, that's a *very* poor default. I don't expect base tools
to canonicalize text like that.

UTF-8 strings work fine when passed to grep(1), but grep doesn't -- nor
would I expect it to -- canonicalize strings, or ignore zero-width
no-break spaces in running text, or any other sort of weird
transformation invented by the Unicode committee.

The only unexpected thing here is xterm doing these transformations
without asking.

-- 
Anthony J. Bentley



Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-11-29 Thread Ingo Schwarze
Hi Anthony,

Anthony J. Bentley wrote on Wed, Nov 29, 2017 at 10:29:28AM -0700:
> Ingo Schwarze writes:

>> That's a bad idea.  Do not use non-ASCII bytes in file names.
>> You are in for all kinds of trouble.

> I don't agree. In a situation where a single user will be accessing
> files,

That's a very strong condition, which will rarely hold.  But sure,
when it does hold, and when the number of files is too large to
assign sensible file names, it partially mitigates the problems.
But only partially.

> you can use whatever naming scheme you like. UTF-8 works exactly
> how you would expect: the filename you enter is the filename you'll
> get.

Until some program from ports decides to legitimately do Unicode
normalization, uses buggy built-in locale components, assumes the
wrong locale, or incorrectly validates character encoding and crashes
or truncates data.  Just as a few examples of what can still go
wrong even on a purely single-user system.  All these are fairly
widespread in the wild.  Quite certainly, xterm is not the only
program doing normalization, and i have rarely seen any program
that is not buggy with respect to multibyte-character handling.

> Misencoded files can also exist, with exactly the results you would
> expect also: you can't necessarily type it, but if you can pass the
> exact filename, programs will work.

Except those using fgetws(3), mbtowc(3), mbstowcs(3), and friends
for reading UTF-8 data and terminating on encoding errors, which
includes for example almost all of the FreeBSD base system, including
POSIX utilities like cut(1).

[...]
> This is indeed xterm's fault.
> 
>   precompose (class Precompose)
> Tells xterm whether to precompose UTF-8 data into Normalization
> Form C, which combines commonly-used accents onto base
> characters.  If it does not do this, accents are left as
> separatate characters.  The default is "true".
> 
> In my opinion, that's a *very* poor default. I don't expect base tools
> to canonicalize text like that.

Base tools certainly shouldn't.  In my opinion, if Xenocara wouldn't,
that would be an improvement, too.  In particular in much-used tools
like xterm(1).  Even if that causes us to diverge a bit from upstream.

> The only unexpected thing here is xterm doing these transformations
> without asking.

I think i would support a diff to fix that near the end of

  /usr/X11R6/share/X11/app-defaults/XTerm  ==
  /usr/xenocara/app/xterm/XTerm.ad

Thanks for digging up the root cause of the OP's issue.

Yours,
  Ingo



Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-11-29 Thread Stefan Sperling
On Wed, Nov 29, 2017 at 07:05:05PM +0100, Ingo Schwarze wrote:
> Anthony J. Bentley wrote on Wed, Nov 29, 2017 at 10:29:28AM -0700:
> > The only unexpected thing here is xterm doing these transformations
> > without asking.
> 
> I think i would support a diff to fix that

Seconded. The current default behaviour is broken.



Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-11-29 Thread Philippe Meunier
Anthony J. Bentley wrote:
>   precompose (class Precompose)

Thanks!  That makes xterm work (almost) as expected:

$ ls
Thérèse
$ ls | od -c
000T   h   e 314 201   r   e 314 200   s   e  \n
014
$ cp Thérèse Thérèse
cp: Thérèse and Thérèse are identical (not copied).

The first filename in the cp command above is created using ksh's
auto-completion and the second filename is created by copy-pasting
the first filename.  So xterm doesn't recompose the characters anymore.

The strange part is that, when I copy the first filename and paste
it to become the second filename, the second filename is shown without
any accent, even though the first and second filenames are now the exact
same sequence of bytes (I checked using od(1)).  So on the command line
it actually looks like this:

$ cp Thérèse Therese
cp: Thérèse and Thérèse are identical (not copied).

which looks wrong but works as expected.  I tried to play with various
things like the allowPasteControls resource but to no avail.  It looks
like an xterm bug to me but at this point I'm not even sure of that...
Anyone has any clue?

Thanks,

Philippe





Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-11-30 Thread Anthony J. Bentley
Philippe Meunier writes:
> The strange part is that, when I copy the first filename and paste
> it to become the second filename, the second filename is shown without
> any accent, even though the first and second filenames are now the exact
> same sequence of bytes (I checked using od(1)).  So on the command line
> it actually looks like this:
>
> $ cp Thérèse Therese
> cp: Thérèse and Thérèse are identical (not copied).
>
> which looks wrong but works as expected.  I tried to play with various
> things like the allowPasteControls resource but to no avail.  It looks
> like an xterm bug to me but at this point I'm not even sure of that...
> Anyone has any clue?

I get the same result, but only when using TrueType fonts (default or no).
If I Ctrl-rightclick and uncheck "TrueType Fonts", the accents show up.
So it looks like xterm's rendering of combining characters is broken, or
unimplemented.



Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-11-30 Thread Philippe Meunier
Anthony J. Bentley wrote:
>I get the same result, but only when using TrueType fonts (default or no).

If I use TrueType fonts:

$ printf "e\xcc\x81\n"

only shows the letter 'e', and when I try to copy-paste it I get a letter
'e' followed by a question mark inside a circle.  If I then redraw the line
I get an 'e' by itself but od(1) shows that it is still e\xcc\x81.

Using TrueType fonts:

$ printf "\xc3\xa9\n"

works fine and I can copy-paste the accented 'e' without problem.



Without TrueType fonts:

$ printf "e\xcc\x81\n"

works fine but when I try to copy-paste the accented 'e' I get a letter 'e'
followed by a question mark inside a circle.  If I then redraw the line I
get the correct accented 'e' again (which od(1) shows is still e\xcc\x81).

Without TrueType fonts:

$ printf "\xc3\xa9\n"

works fine and I can copy-paste the accented 'e' without problem.



So there seems to be two problems:

- Copy-pasting the result of printf "e\xcc\x81\n" never works correctly in
  xterm, regardless of whether I use TrueType fonts or not.  xterm
  copy-pastes the correct sequence of bytes but that sequence is not
  displayed correctly.  That's the same problem I noticed in my previous
  email.

- When using TrueType fonts, printf "e\xcc\x81\n" does not show the accent.

On a note related to this second problem, I never use TrueType fonts in
xterm anyway because then xterm can't display Thai or Chinese or Korean
characters (at least with the default font; I haven't tried to use any
other font).  So I suspect that this second problem is more a font problem
than an xterm bug.

Here's my current config:

$ xrdb -query
xterm*background:   black
xterm*foreground:   white
xterm*metaSendsEscape:  true
xterm*multiScroll:  true
xterm*precompose:   false
xterm*saveLines:256
xterm*scrollBar:true
xterm*scrollKey:true
xterm*scrollTtyOutput:  false
xterm*utf8Title:true
xterm*utmpInhibit:  true
xterm*visualBell:   true

and:

$ set | egrep -i utf
LC_CTYPE=en_US.UTF-8
XTERM_LOCALE=en_US.UTF-8

Philippe




Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-11-30 Thread Allan Streib
Philippe Meunier  writes:

> So there seems to be two problems:
>
> - Copy-pasting the result of printf "e\xcc\x81\n" never works correctly in
>   xterm, regardless of whether I use TrueType fonts or not.  xterm
>   copy-pastes the correct sequence of bytes but that sequence is not
>   displayed correctly.  That's the same problem I noticed in my previous
>   email.
>
> - When using TrueType fonts, printf "e\xcc\x81\n" does not show the accent.

Are you using xterm(1) or uxterm(1)?

When I start uxterm I don't see these behaviors. I see the correct
accented e in all cases.

Allan



Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-11-30 Thread Philippe Meunier
Allan Streib wrote:
>Are you using xterm(1) or uxterm(1)?

uxterm does not exist anymore on OpenBSD 6.1:
https://www.openbsd.org/faq/upgrade61.html

Philippe




Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-11-30 Thread Allan Streib
Philippe Meunier  writes:

> Allan Streib wrote:
>>Are you using xterm(1) or uxterm(1)?
>
> uxterm does not exist anymore on OpenBSD 6.1:
> https://www.openbsd.org/faq/upgrade61.html

Hm. Well that's one that I overlooked. I've been upgrading since 5.x and
I never removed uxterm. I'm on 6.2 now and still using it.

Allan



Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-11-30 Thread Ingo Schwarze
Hi,

Allan Streib wrote on Thu, Nov 30, 2017 at 12:09:13PM -0500:
> Philippe Meunier  writes:
>> Allan Streib wrote:

>>> Are you using xterm(1) or uxterm(1)?

>> uxterm does not exist anymore on OpenBSD 6.1:
>> https://www.openbsd.org/faq/upgrade61.html

> Hm. Well that's one that I overlooked. I've been upgrading since 5.x
> and I never removed uxterm. I'm on 6.2 now and still using it.

It's a trivial but wordy wrapper script.  The only things it does
that i could imagine to be relevant are setting two command line
options: -class UXTerm and -en UTF-8.

The -en option is a deprecated way to hardcode UTF-8 mode for
systems that do not support setlocale(3), so don't use it.
It can't be what helps you here, as UTF-8 works in general.

The -class UXTerm option causes /usr/X11R6/share/X11/app-defaults/UXTerm
to be used instead of /usr/X11R6/share/X11/app-defaults/XTerm.
The UXTerm file was also deleted, as it contains only font stuff
and nobody considered that relevant for anything.

Does the following make things work better for you?
You can apply it directly to /usr/X11R6/share/X11/app-defaults/XTerm
if you want to.  It just copies the UXTerm.ad stuff over and disables
the Precompose resource.  Frankly, i don't have the slightest idea
what the font resources mean, not even after reading the comment
in UXterm.ad, but maybe they are needed for some reason.

Except in a professional typesetting system like groff or LaTeX, i
consider anything that makes the end user worry about fonts
fundamentally broken.  Fonts that work should be installed by default
and not configurable, in my opinion.  Toying around with fonts
causes nothing but grief.

Yours,
  Ingo


Index: XTerm.ad
===
RCS file: /cvs/xenocara/app/xterm/XTerm.ad,v
retrieving revision 1.18
diff -u -p -r1.18 XTerm.ad
--- XTerm.ad15 Jul 2017 19:20:51 -  1.18
+++ XTerm.ad30 Nov 2017 17:52:26 -
@@ -266,6 +266,14 @@
 ! locales.  Even for people using the C/POSIX locale for everything,
 ! that's safer and more usable than the upstream default of "medium".
 *locale: UTF-8
+*precompose: false
+*VT100.utf8: 1
+*VT100.font2: -misc-fixed-medium-r-normal--8-80-75-75-c-50-iso10646-1
+*VT100.font:  -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1
+*VT100.font3: -misc-fixed-medium-r-normal--14-130-75-75-c-70-iso10646-1
+*VT100.font4: -misc-fixed-medium-r-normal--13-120-75-75-c-80-iso10646-1
+*VT100.font5: -misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1
+*VT100.font6: -misc-fixed-medium-r-normal--20-200-75-75-c-100-iso10646-1
 
 ! ScrollBar by default
 *scrollBar: true



Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-11-30 Thread Anthony J. Bentley
Hi Ingo,

Ingo Schwarze writes:
> Except in a professional typesetting system like groff or LaTeX, i
> consider anything that makes the end user worry about fonts
> fundamentally broken.

I think everybody's in agreement that xterm is broken and wrong here.

> Fonts that work should be installed by default
> and not configurable, in my opinion.  Toying around with fonts
> causes nothing but grief.

You'll need extra fonts once I finish my patch to add situationally
appropriate emoji to all our manpages.

> +*precompose: false

Sure.

> +*VT100.utf8: 1

xterm(1):
This option and the utf8 resource are overridden by the -lc and
-en options and locale resource.

We set the locale resource, so this appears redundant.

> +*VT100.font2: -misc-fixed-medium-r-normal--8-80-75-75-c-50-iso10646-1
> +*VT100.font:  -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646
> -1
> +*VT100.font3: -misc-fixed-medium-r-normal--14-130-75-75-c-70-iso10646-1
> +*VT100.font4: -misc-fixed-medium-r-normal--13-120-75-75-c-80-iso10646-1
> +*VT100.font5: -misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1
> +*VT100.font6: -misc-fixed-medium-r-normal--20-200-75-75-c-100-iso10646-1

These are already the default according to appres(1).

-- 
Anthony J. Bentley



Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-12-01 Thread Ingo Schwarze
Hi Anthony,

Anthony J. Bentley wrote on Thu, Nov 30, 2017 at 11:28:54PM -0700:

> You'll need extra fonts once I finish my patch to add situationally
> appropriate emoji to all our manpages.

I'm looking forward to that.  Don't forget to make them animated,
make the colours fully configurable, and maybe add some nice
background music, a pleasant scent, and touchscreen support.

>> +*precompose: false

> Sure.

On a more serious note, i'll commit that tomorrow then
based on OK bentley@ unless somebody can point out a downside.

>> +*VT100.utf8: 1

> xterm(1):
> This option and the utf8 resource are overridden by the -lc and
> -en options and locale resource.
> 
> We set the locale resource, so this appears redundant.

Sounds convincing, so we don't need that, even though it used to be
in UXTerm.ad.

>> +*VT100.font2: -misc-fixed-medium-r-normal--8-80-75-75-c-50-iso10646-1
>> +*VT100.font:  -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646
>> -1
>> +*VT100.font3: -misc-fixed-medium-r-normal--14-130-75-75-c-70-iso10646-1
>> +*VT100.font4: -misc-fixed-medium-r-normal--13-120-75-75-c-80-iso10646-1
>> +*VT100.font5: -misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1
>> +*VT100.font6: -misc-fixed-medium-r-normal--20-200-75-75-c-100-iso10646-1

> These are already the default according to appres(1).

Hum, i don't doubt your analysis.  But now i don't understand why
uxterm(1) works for Allan and plain xterm(1) doesn't...
I mean, what else is there in the old uxterm script that could
possibly make a difference?

Yours,
  Ingo



Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-12-01 Thread Philippe Meunier
Ingo Schwarze wrote:
>Hum, i don't doubt your analysis.  But now i don't understand why
>uxterm(1) works for Allan and plain xterm(1) doesn't...

Re-reading Allan's email, it's not clear to me whether he did his tests
with the precompose resource set to true or false.  If using the default
value of true then:

- Copy-pasting the result of printf "e\xcc\x81\n" works correctly in xterm,
  regardless of whether I use TrueType fonts or not.  That's because, as
  pointed out by Ingo, xterm rewrites e\xcc\x81 into \xc3\xa9.  That's the
  reason why this whole discussion started (and preventing the rewrite is
  then the reason why setting the precompose resource to false makes
  sense).

- When using TrueType fonts, printf "e\xcc\x81\n" shows the accent.  This
  is with the precompose resource set to its default true value.
  Interestingly, when the precompose resource is set to false and TrueType
  fonts are used, the same printf "e\xcc\x81\n" does not show the accent
  (as indicated in one of the my previous emails).  So it looks like this
  is not just a font problem after all but another bug (which Anthony
  actually already pointed out in his second email).

So my conclusions so far are:

- Allan probably did his tests with the precompose resource set to its
  default true value.  It's either that or there is some as yet unknown
  extra factor that makes a difference in the results between him and me.

- When the precompose resource is set to false, copy-pasting the result of
  printf "e\xcc\x81\n" never works correctly in xterm, regardless of
  whether I use TrueType fonts or not.  xterm copy-pastes the correct
  sequence of bytes but that sequence is not displayed correctly.  That's a
  bug in xterm.

- In addition, when the precompose resource is set to false and TrueType
  fonts are used, the result of printf "e\xcc\x81\n" itself is wrong (even
  before trying to copy-paste it): od(1) shows that the correct sequence of
  bytes is printed but it is displayed without accent.  That's another bug
  in xterm.  The result is displayed correctly when the precompose resource
  is set to true.

Philippe




Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-12-01 Thread Anthony J. Bentley
Ingo Schwarze writes:
> >> +*precompose: false
>
> > Sure.
>
> On a more serious note, i'll commit that tomorrow then
> based on OK bentley@ unless somebody can point out a downside.

Please update the OPENBSD SPECIFICS section of the manual as well.

> Hum, i don't doubt your analysis.  But now i don't understand why
> uxterm(1) works for Allan and plain xterm(1) doesn't...

Yeah, my guess is he never disabled precomposition for uxterm,
meaning what he's seeing are not actually combining characters,
meaning xterm doesn't bug out.



Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-12-01 Thread Allan Streib
Philippe Meunier  writes:

> - Allan probably did his tests with the precompose resource set to its
>   default true value.

I assume this is correct because I have never deliberately changed it.

And you're right after all.

$ printf "e\xcc\x81\n" | od -a
000e  cc  81  nl

$ printf "e\xcc\x81\n"
é

^ copy/pasting: $ echo "é" | od -a
000   c3  a9  nl

Allan



Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-12-01 Thread Anthony J. Bentley
Philippe Meunier writes:
> - When the precompose resource is set to false, copy-pasting the result of
>   printf "e\xcc\x81\n" never works correctly in xterm, regardless of
>   whether I use TrueType fonts or not.  xterm copy-pastes the correct
>   sequence of bytes but that sequence is not displayed correctly.  That's a
>   bug in xterm.

I get slightly different results: with TrueType fonts enabled, LC_CTYPE
set to en_US.UTF-8, and precompose disabled, accents are not displayed,
but they do copy and paste correctly. I tested this on a fresh install as
well as my desktop. I haven't been able to trigger the results you're
getting (best guess: your LC_CTYPE is unset or set funny? But I don't get
the same results even then).

> - In addition, when the precompose resource is set to false and TrueType
>   fonts are used, the result of printf "e\xcc\x81\n" itself is wrong (even
>   before trying to copy-paste it): od(1) shows that the correct sequence of
>   bytes is printed but it is displayed without accent.  That's another bug
>   in xterm.  The result is displayed correctly when the precompose resource
>   is set to true.

Yes, this matches what I'm seeing.



Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-12-01 Thread Ingo Schwarze
Hi,

Anthony J. Bentley wrote on Fri, Dec 01, 2017 at 08:18:59AM -0700:
> Philippe Meunier writes:

>> - In addition, when the precompose resource is set to false and TrueType
>>   fonts are used, the result of printf "e\xcc\x81\n" itself is wrong (even
>>   before trying to copy-paste it): od(1) shows that the correct sequence of
>>   bytes is printed but it is displayed without accent.  That's another bug
>>   in xterm.  The result is displayed correctly when the precompose resource
>>   is set to true.

> Yes, this matches what I'm seeing.
 
To me, that seems to imply that xterm(1), with the bugs it has now,
works significantly better with Precompose enabled: at least it
displays the correct glyphs, while there seem to be cases where it
displays wrong glyphs without Precompose.  Right?

Doesn't that imply that it would be better to fix this bug first,
before disabling Precompose?  I certainly hate that xterm(1) is
doing normalization by default now, but if removing that exposes a
bug that causes display of incorrect glyphs, that would seem like
a serious regression to me.

What do you think?
  Ingo



Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-12-01 Thread Anthony J. Bentley
Ingo Schwarze writes:
> Hi,
>
> Anthony J. Bentley wrote on Fri, Dec 01, 2017 at 08:18:59AM -0700:
> > Philippe Meunier writes:
>
> >> - In addition, when the precompose resource is set to false and TrueType
> >>   fonts are used, the result of printf "e\xcc\x81\n" itself is wrong (even
> >>   before trying to copy-paste it): od(1) shows that the correct sequence o
> f
> >>   bytes is printed but it is displayed without accent.  That's another bug
> >>   in xterm.  The result is displayed correctly when the precompose resourc
> e
> >>   is set to true.
>
> > Yes, this matches what I'm seeing.
>  
> To me, that seems to imply that xterm(1), with the bugs it has now,
> works significantly better with Precompose enabled: at least it
> displays the correct glyphs, while there seem to be cases where it
> displays wrong glyphs without Precompose.  Right?
>
> Doesn't that imply that it would be better to fix this bug first,
> before disabling Precompose?  I certainly hate that xterm(1) is
> doing normalization by default now, but if removing that exposes a
> bug that causes display of incorrect glyphs, that would seem like
> a serious regression to me.
>
> What do you think?

I was internally debating this earlier. The bug is already exposed by
any combining characters that don't have precomposed forms. It also
doesn't show up with the default (i.e. non TrueType) fonts. Given that
and how unfriendly the precomposition behavior is, I think disabling it
is still reasonable.



Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-12-01 Thread Allan Streib


Allan Streib  writes:

> $ printf "e\xcc\x81\n" | od -a
> 000e  cc  81  nl
>
> $ printf "e\xcc\x81\n"
> é
>
> ^ copy/pasting: $ echo "é" | od -a
> 000   c3  a9  nl

Also in case it's interesting:

$ printf "e\xcc\x81" | xclip -i

$ xclip -o | od -a  
000e  cc  81


$ echo "é" | od -a
000e  cc  81  nl

In the above, the "é" was obtained with middle-click (paste).


$ echo "é" | od -a
000   c3  a9  nl

In the above, the entire command 'echo "é" | od -a' was copied from the
prior line and pasted with the mouse.

Allan





Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-12-01 Thread Philippe Meunier
Anthony J. Bentley wrote:
>Philippe Meunier writes:
>> - When the precompose resource is set to false, copy-pasting the result of
>>   printf "e\xcc\x81\n" never works correctly in xterm, regardless of
>>   whether I use TrueType fonts or not.  xterm copy-pastes the correct
>>   sequence of bytes but that sequence is not displayed correctly.  That's a
>>   bug in xterm.
>
>I get slightly different results: with TrueType fonts enabled, LC_CTYPE
>set to en_US.UTF-8, and precompose disabled, accents are not displayed,
>but they do copy and paste correctly. I tested this on a fresh install as
>well as my desktop. I haven't been able to trigger the results you're
>getting (best guess: your LC_CTYPE is unset or set funny? But I don't get
>the same results even then).

Strange.  I have:

$ set | egrep -i 'utf|xterm'
LC_CTYPE=en_US.UTF-8
TERM=xterm
XTERM_LOCALE=en_US.UTF-8
XTERM_SHELL=/bin/ksh
XTERM_VERSION='XTerm/OpenBSD(327)'

and even with just this:

$ xrdb -query
xterm*precompose:   false

and TrueType enabled, then accents are not displayed and copy-paste does
not work: I get an 'e' character followed by another character which is a
question mark inside a circle.

Philippe




Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-12-01 Thread Philippe Meunier
Anthony J. Bentley wrote:
>I was internally debating this earlier. The bug is already exposed by
>any combining characters that don't have precomposed forms. It also
>doesn't show up with the default (i.e. non TrueType) fonts. Given that
>and how unfriendly the precomposition behavior is, I think disabling it
>is still reasonable.

I'd agree with that.  TrueType fonts are not the default.  I think it's
more important to get copy-paste to work the way one would expect it to
work (even if it displays the characters the wrong way).

Philippe




Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-12-01 Thread Stefan Sperling
On Fri, Dec 01, 2017 at 12:14:48PM +0100, Ingo Schwarze wrote:
> Hi Anthony,
> 
> Anthony J. Bentley wrote on Thu, Nov 30, 2017 at 11:28:54PM -0700:
> 
> > You'll need extra fonts once I finish my patch to add situationally
> > appropriate emoji to all our manpages.
> 
> I'm looking forward to that.  Don't forget to make them animated,
> make the colours fully configurable, and maybe add some nice
> background music, a pleasant scent, and touchscreen support.

And make them soft and plushy to the touch!



Re: xterm(1) changing UTF-8 characters when copy-pasting?

2017-12-01 Thread Philip Guenther
On Fri, Dec 1, 2017 at 11:38 AM, Stefan Sperling  wrote:

> On Fri, Dec 01, 2017 at 12:14:48PM +0100, Ingo Schwarze wrote:
> > Anthony J. Bentley wrote on Thu, Nov 30, 2017 at 11:28:54PM -0700:
> >
> > > You'll need extra fonts once I finish my patch to add situationally
> > > appropriate emoji to all our manpages.
> >
> > I'm looking forward to that.  Don't forget to make them animated,
> > make the colours fully configurable, and maybe add some nice
> > background music, a pleasant scent, and touchscreen support.
>
> And make them soft and plushy to the touch!
>

Or spiney and plushy, for when we switch the manpage footer from saying
"OpenBSD 6.2" to " 6.2"!