Re: xterm(1) changing UTF-8 characters when copy-pasting?
Hi Philippe, Philippe Meunier wrote on Wed, Nov 29, 2017 at 09:11:38AM -0500: > I've noticed something unexpected when copy-pasting UTF-8 characters in > xterm: xterm seems to change some of the characters into something > different but visually similar. Here's an example (using ksh): > > $ uname -a > OpenBSD foo.my.domain 6.1 GENERIC#19 i386 > $ ls > Thérèse That's a bad idea. Do not use non-ASCII bytes in file names. You are in for all kinds of trouble. Not so much because using arbitrary bytes in file names would be invalid, but because their meaning is completely undefined on any UNIX-like operating system. By definition, file names are byte strings, not character strings. They do NOT have a meaning in any particular locale and are NOT representing accented characters. In this respect, OpenBSD is better than other operating systems. The problem is mostly hidden on OpenBSD because OpenBSD supports UTF-8 only. So if you use UTF-8 characters in file names, you often get away with it simply because it's the only locale supported by the system. But, as you see, even on OpenBSD, you do not always get away with such recklessness. On other systems supporting different locales, each user can choose their own locale, so one user may have UTF-8 set, another one ISO-LATIN-something, and yet another one Shift-JIS. But there is only one file system. So every filename will be gibberish for all users except for the one user having a locale where it happens to be validly encoded. Speak after me: A file system does not have a locale. Non-ASCII characters cannot be encoded in file names, on any UNIX in general. (Windows is different, but at the price of badly violating POSIX in significant parts of its C library). > $ ls | od -c > 000T h e 314 201 r e 314 200 s e \n > 014 > $ cp Thérèse Thérèse > > This copy command is typed as follows: type 'cp ', press tab for ksh to > auto-complete the first filename, another space, then use the mouse to > copy-paste the first filename into xterm to get the second filename. > The cp command works without any error. The result is: $ printf "\xcc\x81" | uniname character byte UTF-32 encoded as glyph name 0 0 65 65 e LATIN SMALL LETTER E 1 1 000301 CC 81 COMBINING ACUTE ACCENT $ printf "\xc3\xa9" | uniname character byte UTF-32 encoded as glyph name 0 0 E9 C3 A9 \ LATIN SMALL LETTER E WITH ACUTE That's called "canonical composition" in Unicode. The UTF-8 multibyte character sequences "e\xcc\x81" and "\xc3\xa9" are canonically equivalent, which means that multibyte-character aware software is required to treat both identically, and such software is allowed to silently substitute one for the other. Of course, the file system is not multibyte-character aware and not allowed to be, so as a file name, both names are different. Yes, you heard correctly: Not only can filenames containing *semantically different* Unicode characters have identical visual representation, but the filesystem is also required to treat filenames as different that have *identical* semantics in Unicode. Do not use Unicode for filenames. It simply doesn't work and is a security nightmare on top of that. The reason for UTF-8 support in ls(1) isn't to encourage UTF-8 filenames. It is merely a crutch helping to display as much information as possible about broken file systems. They are still broken and dangerous. > So it looks like xterm is changing I'm not convinced it is xterm; it might also be the X libraries supporting copying with the mouse. Anyway, whatever does it is allowed to. It's certainly not ksh(1) because our ksh is not fully multibyte- character aware on purpose, but deliberately has only limited multibyte-character support. We want predictable, not surprising behaviour in the shell. In particular, our ksh never changes byte sequences. Yours, Ingo
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Ingo Schwarze wrote: >Philippe Meunier wrote: >> $ ls >> Thérèse > >That's a bad idea. Do not use non-ASCII bytes in file names. That's a nice thought but in practice I have some files on that machine with names written in French, Thai, Chinese, Korean, and Japanese, and for some of these files renaming is not an option for work reasons. I somehow doubt that I'm the only one in such a situation. >In this respect, OpenBSD is better than other operating systems. >The problem is mostly hidden on OpenBSD because OpenBSD supports >UTF-8 only. Yes, I've noticed that the UTF-8 support in OpenBSD has become much nicer in recent years. My thanks to the devs who did that :-) >That's called "canonical composition" in Unicode. *sigh* I see. Well, I learned something new today. Thanks for the info. >It's certainly not ksh(1) because our ksh is not fully multibyte- >character aware on purpose, but deliberately has only limited >multibyte-character support. Actually, since you brought this up, I wish ksh had fuller multibyte character support. As you say above the problem is mostly hidden and most of the time it happens to just work, but, for example, trying to delete double-wide Korean characters (well, syllables, really, which are *all* double-wide) messes up the command line: the double-wide characters are correctly deleted but the cursor moves left by only one position for each delete which means that very quickly I lose track of which characters I'm actually deleting and I'm forced to redraw the line. Anyway, at this point it's mostly anecdotal; most things work out of the box. Philippe
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Hi Philippe, Philippe Meunier wrote on Wed, Nov 29, 2017 at 11:35:59AM -0500: > Ingo Schwarze wrote: >> Philippe Meunier wrote: >>> $ ls >>> Thérèse >> That's a bad idea. Do not use non-ASCII bytes in file names. > That's a nice thought but in practice I have some files on that machine > with names written in French, Thai, Chinese, Korean, and Japanese, and for > some of these files renaming is not an option for work reasons. I somehow > doubt that I'm the only one in such a situation. Sure. In some situations, there is no viable alternative to dealing with file systems containing broken filenames. That's why we try to make tools like ls(1) as useful as possible in such a bad situation. But you can never expect a smooth user experience. It is not an OpenBSD-specific problem, in facts it's worse almost everywhere else, although not everybody is likely to admit that. >> It's certainly not ksh(1) because our ksh is not fully multibyte- >> character aware on purpose, but deliberately has only limited >> multibyte-character support. > Actually, since you brought this up, I wish ksh had fuller multibyte > character support. As you say above the problem is mostly hidden and most > of the time it happens to just work, but, for example, trying to delete > double-wide Korean characters (well, syllables, really, which are *all* > double-wide) messes up the command line: That is indeed expected, and it is one of the things that are very unlikely to change even in the long term. Adding support for correctly handling character display widths in shell command line editing would require calling functions like mbtowc(3) and wcwidth(3) on the fly in the command line editing modules. Such changes would be fairly intrusive and carry a substantial risk of introducing nasty, perhaps even security-relevant bugs into the shell, so even if somebody would cook up patches, i'm not convinced that they could go in. That said, i see that you are actually torturing our shell in these respects quite a bit. As long as you don't expect that everything can be fixed, you are quite welcome to report issues that you see. I don't doubt that there are still outright bugs, and it also seems likely that there are missing features which can be implemented without making a mess of the shell. So reports based on real everyday use are definitely helpful. While several developers understand the basics of how multibyte character support works in the shell and in some others of our POSIX utilities, very few use that heavily, as far as i know. Yours, Ingo
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Ingo Schwarze writes: > That's a bad idea. Do not use non-ASCII bytes in file names. > You are in for all kinds of trouble. I don't agree. In a situation where a single user will be accessing files, you can use whatever naming scheme you like. UTF-8 works exactly how you would expect: the filename you enter is the filename you'll get. Misencoded files can also exist, with exactly the results you would expect also: you can't necessarily type it, but if you can pass the exact filename, programs will work. Same goes with control characters like backspaces in file names (far more annoying than UTF-8). Saying you can't is impractical. Anyone downloading lots of external files through web browsers, torrent clients, or any number of other programs in ports will eventually encounter files with UTF-8 filenames. They work just fine. Keeping spaces out of filenames is already a lost battle, let alone limiting them to the POSIX portable filename character set (A-Za-z0-9._-). Obviously once you start talking about files on external media or otherwise accessible by users in other locales, that conclusion changes. But I'm talking about a personal desktop here. > > So it looks like xterm is changing > > I'm not convinced it is xterm; it might also be the X libraries > supporting copying with the mouse. Anyway, whatever does it is > allowed to. This is indeed xterm's fault. precompose (class Precompose) Tells xterm whether to precompose UTF-8 data into Normalization Form C, which combines commonly-used accents onto base characters. If it does not do this, accents are left as separatate characters. The default is "true". In my opinion, that's a *very* poor default. I don't expect base tools to canonicalize text like that. UTF-8 strings work fine when passed to grep(1), but grep doesn't -- nor would I expect it to -- canonicalize strings, or ignore zero-width no-break spaces in running text, or any other sort of weird transformation invented by the Unicode committee. The only unexpected thing here is xterm doing these transformations without asking. -- Anthony J. Bentley
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Hi Anthony, Anthony J. Bentley wrote on Wed, Nov 29, 2017 at 10:29:28AM -0700: > Ingo Schwarze writes: >> That's a bad idea. Do not use non-ASCII bytes in file names. >> You are in for all kinds of trouble. > I don't agree. In a situation where a single user will be accessing > files, That's a very strong condition, which will rarely hold. But sure, when it does hold, and when the number of files is too large to assign sensible file names, it partially mitigates the problems. But only partially. > you can use whatever naming scheme you like. UTF-8 works exactly > how you would expect: the filename you enter is the filename you'll > get. Until some program from ports decides to legitimately do Unicode normalization, uses buggy built-in locale components, assumes the wrong locale, or incorrectly validates character encoding and crashes or truncates data. Just as a few examples of what can still go wrong even on a purely single-user system. All these are fairly widespread in the wild. Quite certainly, xterm is not the only program doing normalization, and i have rarely seen any program that is not buggy with respect to multibyte-character handling. > Misencoded files can also exist, with exactly the results you would > expect also: you can't necessarily type it, but if you can pass the > exact filename, programs will work. Except those using fgetws(3), mbtowc(3), mbstowcs(3), and friends for reading UTF-8 data and terminating on encoding errors, which includes for example almost all of the FreeBSD base system, including POSIX utilities like cut(1). [...] > This is indeed xterm's fault. > > precompose (class Precompose) > Tells xterm whether to precompose UTF-8 data into Normalization > Form C, which combines commonly-used accents onto base > characters. If it does not do this, accents are left as > separatate characters. The default is "true". > > In my opinion, that's a *very* poor default. I don't expect base tools > to canonicalize text like that. Base tools certainly shouldn't. In my opinion, if Xenocara wouldn't, that would be an improvement, too. In particular in much-used tools like xterm(1). Even if that causes us to diverge a bit from upstream. > The only unexpected thing here is xterm doing these transformations > without asking. I think i would support a diff to fix that near the end of /usr/X11R6/share/X11/app-defaults/XTerm == /usr/xenocara/app/xterm/XTerm.ad Thanks for digging up the root cause of the OP's issue. Yours, Ingo
Re: xterm(1) changing UTF-8 characters when copy-pasting?
On Wed, Nov 29, 2017 at 07:05:05PM +0100, Ingo Schwarze wrote: > Anthony J. Bentley wrote on Wed, Nov 29, 2017 at 10:29:28AM -0700: > > The only unexpected thing here is xterm doing these transformations > > without asking. > > I think i would support a diff to fix that Seconded. The current default behaviour is broken.
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Anthony J. Bentley wrote: > precompose (class Precompose) Thanks! That makes xterm work (almost) as expected: $ ls Thérèse $ ls | od -c 000T h e 314 201 r e 314 200 s e \n 014 $ cp Thérèse Thérèse cp: Thérèse and Thérèse are identical (not copied). The first filename in the cp command above is created using ksh's auto-completion and the second filename is created by copy-pasting the first filename. So xterm doesn't recompose the characters anymore. The strange part is that, when I copy the first filename and paste it to become the second filename, the second filename is shown without any accent, even though the first and second filenames are now the exact same sequence of bytes (I checked using od(1)). So on the command line it actually looks like this: $ cp Thérèse Therese cp: Thérèse and Thérèse are identical (not copied). which looks wrong but works as expected. I tried to play with various things like the allowPasteControls resource but to no avail. It looks like an xterm bug to me but at this point I'm not even sure of that... Anyone has any clue? Thanks, Philippe
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Philippe Meunier writes: > The strange part is that, when I copy the first filename and paste > it to become the second filename, the second filename is shown without > any accent, even though the first and second filenames are now the exact > same sequence of bytes (I checked using od(1)). So on the command line > it actually looks like this: > > $ cp Thérèse Therese > cp: Thérèse and Thérèse are identical (not copied). > > which looks wrong but works as expected. I tried to play with various > things like the allowPasteControls resource but to no avail. It looks > like an xterm bug to me but at this point I'm not even sure of that... > Anyone has any clue? I get the same result, but only when using TrueType fonts (default or no). If I Ctrl-rightclick and uncheck "TrueType Fonts", the accents show up. So it looks like xterm's rendering of combining characters is broken, or unimplemented.
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Anthony J. Bentley wrote: >I get the same result, but only when using TrueType fonts (default or no). If I use TrueType fonts: $ printf "e\xcc\x81\n" only shows the letter 'e', and when I try to copy-paste it I get a letter 'e' followed by a question mark inside a circle. If I then redraw the line I get an 'e' by itself but od(1) shows that it is still e\xcc\x81. Using TrueType fonts: $ printf "\xc3\xa9\n" works fine and I can copy-paste the accented 'e' without problem. Without TrueType fonts: $ printf "e\xcc\x81\n" works fine but when I try to copy-paste the accented 'e' I get a letter 'e' followed by a question mark inside a circle. If I then redraw the line I get the correct accented 'e' again (which od(1) shows is still e\xcc\x81). Without TrueType fonts: $ printf "\xc3\xa9\n" works fine and I can copy-paste the accented 'e' without problem. So there seems to be two problems: - Copy-pasting the result of printf "e\xcc\x81\n" never works correctly in xterm, regardless of whether I use TrueType fonts or not. xterm copy-pastes the correct sequence of bytes but that sequence is not displayed correctly. That's the same problem I noticed in my previous email. - When using TrueType fonts, printf "e\xcc\x81\n" does not show the accent. On a note related to this second problem, I never use TrueType fonts in xterm anyway because then xterm can't display Thai or Chinese or Korean characters (at least with the default font; I haven't tried to use any other font). So I suspect that this second problem is more a font problem than an xterm bug. Here's my current config: $ xrdb -query xterm*background: black xterm*foreground: white xterm*metaSendsEscape: true xterm*multiScroll: true xterm*precompose: false xterm*saveLines:256 xterm*scrollBar:true xterm*scrollKey:true xterm*scrollTtyOutput: false xterm*utf8Title:true xterm*utmpInhibit: true xterm*visualBell: true and: $ set | egrep -i utf LC_CTYPE=en_US.UTF-8 XTERM_LOCALE=en_US.UTF-8 Philippe
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Philippe Meunier writes: > So there seems to be two problems: > > - Copy-pasting the result of printf "e\xcc\x81\n" never works correctly in > xterm, regardless of whether I use TrueType fonts or not. xterm > copy-pastes the correct sequence of bytes but that sequence is not > displayed correctly. That's the same problem I noticed in my previous > email. > > - When using TrueType fonts, printf "e\xcc\x81\n" does not show the accent. Are you using xterm(1) or uxterm(1)? When I start uxterm I don't see these behaviors. I see the correct accented e in all cases. Allan
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Allan Streib wrote: >Are you using xterm(1) or uxterm(1)? uxterm does not exist anymore on OpenBSD 6.1: https://www.openbsd.org/faq/upgrade61.html Philippe
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Philippe Meunier writes: > Allan Streib wrote: >>Are you using xterm(1) or uxterm(1)? > > uxterm does not exist anymore on OpenBSD 6.1: > https://www.openbsd.org/faq/upgrade61.html Hm. Well that's one that I overlooked. I've been upgrading since 5.x and I never removed uxterm. I'm on 6.2 now and still using it. Allan
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Hi, Allan Streib wrote on Thu, Nov 30, 2017 at 12:09:13PM -0500: > Philippe Meunier writes: >> Allan Streib wrote: >>> Are you using xterm(1) or uxterm(1)? >> uxterm does not exist anymore on OpenBSD 6.1: >> https://www.openbsd.org/faq/upgrade61.html > Hm. Well that's one that I overlooked. I've been upgrading since 5.x > and I never removed uxterm. I'm on 6.2 now and still using it. It's a trivial but wordy wrapper script. The only things it does that i could imagine to be relevant are setting two command line options: -class UXTerm and -en UTF-8. The -en option is a deprecated way to hardcode UTF-8 mode for systems that do not support setlocale(3), so don't use it. It can't be what helps you here, as UTF-8 works in general. The -class UXTerm option causes /usr/X11R6/share/X11/app-defaults/UXTerm to be used instead of /usr/X11R6/share/X11/app-defaults/XTerm. The UXTerm file was also deleted, as it contains only font stuff and nobody considered that relevant for anything. Does the following make things work better for you? You can apply it directly to /usr/X11R6/share/X11/app-defaults/XTerm if you want to. It just copies the UXTerm.ad stuff over and disables the Precompose resource. Frankly, i don't have the slightest idea what the font resources mean, not even after reading the comment in UXterm.ad, but maybe they are needed for some reason. Except in a professional typesetting system like groff or LaTeX, i consider anything that makes the end user worry about fonts fundamentally broken. Fonts that work should be installed by default and not configurable, in my opinion. Toying around with fonts causes nothing but grief. Yours, Ingo Index: XTerm.ad === RCS file: /cvs/xenocara/app/xterm/XTerm.ad,v retrieving revision 1.18 diff -u -p -r1.18 XTerm.ad --- XTerm.ad15 Jul 2017 19:20:51 - 1.18 +++ XTerm.ad30 Nov 2017 17:52:26 - @@ -266,6 +266,14 @@ ! locales. Even for people using the C/POSIX locale for everything, ! that's safer and more usable than the upstream default of "medium". *locale: UTF-8 +*precompose: false +*VT100.utf8: 1 +*VT100.font2: -misc-fixed-medium-r-normal--8-80-75-75-c-50-iso10646-1 +*VT100.font: -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1 +*VT100.font3: -misc-fixed-medium-r-normal--14-130-75-75-c-70-iso10646-1 +*VT100.font4: -misc-fixed-medium-r-normal--13-120-75-75-c-80-iso10646-1 +*VT100.font5: -misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1 +*VT100.font6: -misc-fixed-medium-r-normal--20-200-75-75-c-100-iso10646-1 ! ScrollBar by default *scrollBar: true
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Hi Ingo, Ingo Schwarze writes: > Except in a professional typesetting system like groff or LaTeX, i > consider anything that makes the end user worry about fonts > fundamentally broken. I think everybody's in agreement that xterm is broken and wrong here. > Fonts that work should be installed by default > and not configurable, in my opinion. Toying around with fonts > causes nothing but grief. You'll need extra fonts once I finish my patch to add situationally appropriate emoji to all our manpages. > +*precompose: false Sure. > +*VT100.utf8: 1 xterm(1): This option and the utf8 resource are overridden by the -lc and -en options and locale resource. We set the locale resource, so this appears redundant. > +*VT100.font2: -misc-fixed-medium-r-normal--8-80-75-75-c-50-iso10646-1 > +*VT100.font: -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646 > -1 > +*VT100.font3: -misc-fixed-medium-r-normal--14-130-75-75-c-70-iso10646-1 > +*VT100.font4: -misc-fixed-medium-r-normal--13-120-75-75-c-80-iso10646-1 > +*VT100.font5: -misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1 > +*VT100.font6: -misc-fixed-medium-r-normal--20-200-75-75-c-100-iso10646-1 These are already the default according to appres(1). -- Anthony J. Bentley
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Hi Anthony, Anthony J. Bentley wrote on Thu, Nov 30, 2017 at 11:28:54PM -0700: > You'll need extra fonts once I finish my patch to add situationally > appropriate emoji to all our manpages. I'm looking forward to that. Don't forget to make them animated, make the colours fully configurable, and maybe add some nice background music, a pleasant scent, and touchscreen support. >> +*precompose: false > Sure. On a more serious note, i'll commit that tomorrow then based on OK bentley@ unless somebody can point out a downside. >> +*VT100.utf8: 1 > xterm(1): > This option and the utf8 resource are overridden by the -lc and > -en options and locale resource. > > We set the locale resource, so this appears redundant. Sounds convincing, so we don't need that, even though it used to be in UXTerm.ad. >> +*VT100.font2: -misc-fixed-medium-r-normal--8-80-75-75-c-50-iso10646-1 >> +*VT100.font: -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646 >> -1 >> +*VT100.font3: -misc-fixed-medium-r-normal--14-130-75-75-c-70-iso10646-1 >> +*VT100.font4: -misc-fixed-medium-r-normal--13-120-75-75-c-80-iso10646-1 >> +*VT100.font5: -misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1 >> +*VT100.font6: -misc-fixed-medium-r-normal--20-200-75-75-c-100-iso10646-1 > These are already the default according to appres(1). Hum, i don't doubt your analysis. But now i don't understand why uxterm(1) works for Allan and plain xterm(1) doesn't... I mean, what else is there in the old uxterm script that could possibly make a difference? Yours, Ingo
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Ingo Schwarze wrote: >Hum, i don't doubt your analysis. But now i don't understand why >uxterm(1) works for Allan and plain xterm(1) doesn't... Re-reading Allan's email, it's not clear to me whether he did his tests with the precompose resource set to true or false. If using the default value of true then: - Copy-pasting the result of printf "e\xcc\x81\n" works correctly in xterm, regardless of whether I use TrueType fonts or not. That's because, as pointed out by Ingo, xterm rewrites e\xcc\x81 into \xc3\xa9. That's the reason why this whole discussion started (and preventing the rewrite is then the reason why setting the precompose resource to false makes sense). - When using TrueType fonts, printf "e\xcc\x81\n" shows the accent. This is with the precompose resource set to its default true value. Interestingly, when the precompose resource is set to false and TrueType fonts are used, the same printf "e\xcc\x81\n" does not show the accent (as indicated in one of the my previous emails). So it looks like this is not just a font problem after all but another bug (which Anthony actually already pointed out in his second email). So my conclusions so far are: - Allan probably did his tests with the precompose resource set to its default true value. It's either that or there is some as yet unknown extra factor that makes a difference in the results between him and me. - When the precompose resource is set to false, copy-pasting the result of printf "e\xcc\x81\n" never works correctly in xterm, regardless of whether I use TrueType fonts or not. xterm copy-pastes the correct sequence of bytes but that sequence is not displayed correctly. That's a bug in xterm. - In addition, when the precompose resource is set to false and TrueType fonts are used, the result of printf "e\xcc\x81\n" itself is wrong (even before trying to copy-paste it): od(1) shows that the correct sequence of bytes is printed but it is displayed without accent. That's another bug in xterm. The result is displayed correctly when the precompose resource is set to true. Philippe
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Ingo Schwarze writes: > >> +*precompose: false > > > Sure. > > On a more serious note, i'll commit that tomorrow then > based on OK bentley@ unless somebody can point out a downside. Please update the OPENBSD SPECIFICS section of the manual as well. > Hum, i don't doubt your analysis. But now i don't understand why > uxterm(1) works for Allan and plain xterm(1) doesn't... Yeah, my guess is he never disabled precomposition for uxterm, meaning what he's seeing are not actually combining characters, meaning xterm doesn't bug out.
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Philippe Meunier writes: > - Allan probably did his tests with the precompose resource set to its > default true value. I assume this is correct because I have never deliberately changed it. And you're right after all. $ printf "e\xcc\x81\n" | od -a 000e cc 81 nl $ printf "e\xcc\x81\n" é ^ copy/pasting: $ echo "é" | od -a 000 c3 a9 nl Allan
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Philippe Meunier writes: > - When the precompose resource is set to false, copy-pasting the result of > printf "e\xcc\x81\n" never works correctly in xterm, regardless of > whether I use TrueType fonts or not. xterm copy-pastes the correct > sequence of bytes but that sequence is not displayed correctly. That's a > bug in xterm. I get slightly different results: with TrueType fonts enabled, LC_CTYPE set to en_US.UTF-8, and precompose disabled, accents are not displayed, but they do copy and paste correctly. I tested this on a fresh install as well as my desktop. I haven't been able to trigger the results you're getting (best guess: your LC_CTYPE is unset or set funny? But I don't get the same results even then). > - In addition, when the precompose resource is set to false and TrueType > fonts are used, the result of printf "e\xcc\x81\n" itself is wrong (even > before trying to copy-paste it): od(1) shows that the correct sequence of > bytes is printed but it is displayed without accent. That's another bug > in xterm. The result is displayed correctly when the precompose resource > is set to true. Yes, this matches what I'm seeing.
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Hi, Anthony J. Bentley wrote on Fri, Dec 01, 2017 at 08:18:59AM -0700: > Philippe Meunier writes: >> - In addition, when the precompose resource is set to false and TrueType >> fonts are used, the result of printf "e\xcc\x81\n" itself is wrong (even >> before trying to copy-paste it): od(1) shows that the correct sequence of >> bytes is printed but it is displayed without accent. That's another bug >> in xterm. The result is displayed correctly when the precompose resource >> is set to true. > Yes, this matches what I'm seeing. To me, that seems to imply that xterm(1), with the bugs it has now, works significantly better with Precompose enabled: at least it displays the correct glyphs, while there seem to be cases where it displays wrong glyphs without Precompose. Right? Doesn't that imply that it would be better to fix this bug first, before disabling Precompose? I certainly hate that xterm(1) is doing normalization by default now, but if removing that exposes a bug that causes display of incorrect glyphs, that would seem like a serious regression to me. What do you think? Ingo
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Ingo Schwarze writes: > Hi, > > Anthony J. Bentley wrote on Fri, Dec 01, 2017 at 08:18:59AM -0700: > > Philippe Meunier writes: > > >> - In addition, when the precompose resource is set to false and TrueType > >> fonts are used, the result of printf "e\xcc\x81\n" itself is wrong (even > >> before trying to copy-paste it): od(1) shows that the correct sequence o > f > >> bytes is printed but it is displayed without accent. That's another bug > >> in xterm. The result is displayed correctly when the precompose resourc > e > >> is set to true. > > > Yes, this matches what I'm seeing. > > To me, that seems to imply that xterm(1), with the bugs it has now, > works significantly better with Precompose enabled: at least it > displays the correct glyphs, while there seem to be cases where it > displays wrong glyphs without Precompose. Right? > > Doesn't that imply that it would be better to fix this bug first, > before disabling Precompose? I certainly hate that xterm(1) is > doing normalization by default now, but if removing that exposes a > bug that causes display of incorrect glyphs, that would seem like > a serious regression to me. > > What do you think? I was internally debating this earlier. The bug is already exposed by any combining characters that don't have precomposed forms. It also doesn't show up with the default (i.e. non TrueType) fonts. Given that and how unfriendly the precomposition behavior is, I think disabling it is still reasonable.
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Allan Streib writes: > $ printf "e\xcc\x81\n" | od -a > 000e cc 81 nl > > $ printf "e\xcc\x81\n" > é > > ^ copy/pasting: $ echo "é" | od -a > 000 c3 a9 nl Also in case it's interesting: $ printf "e\xcc\x81" | xclip -i $ xclip -o | od -a 000e cc 81 $ echo "é" | od -a 000e cc 81 nl In the above, the "é" was obtained with middle-click (paste). $ echo "é" | od -a 000 c3 a9 nl In the above, the entire command 'echo "é" | od -a' was copied from the prior line and pasted with the mouse. Allan
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Anthony J. Bentley wrote: >Philippe Meunier writes: >> - When the precompose resource is set to false, copy-pasting the result of >> printf "e\xcc\x81\n" never works correctly in xterm, regardless of >> whether I use TrueType fonts or not. xterm copy-pastes the correct >> sequence of bytes but that sequence is not displayed correctly. That's a >> bug in xterm. > >I get slightly different results: with TrueType fonts enabled, LC_CTYPE >set to en_US.UTF-8, and precompose disabled, accents are not displayed, >but they do copy and paste correctly. I tested this on a fresh install as >well as my desktop. I haven't been able to trigger the results you're >getting (best guess: your LC_CTYPE is unset or set funny? But I don't get >the same results even then). Strange. I have: $ set | egrep -i 'utf|xterm' LC_CTYPE=en_US.UTF-8 TERM=xterm XTERM_LOCALE=en_US.UTF-8 XTERM_SHELL=/bin/ksh XTERM_VERSION='XTerm/OpenBSD(327)' and even with just this: $ xrdb -query xterm*precompose: false and TrueType enabled, then accents are not displayed and copy-paste does not work: I get an 'e' character followed by another character which is a question mark inside a circle. Philippe
Re: xterm(1) changing UTF-8 characters when copy-pasting?
Anthony J. Bentley wrote: >I was internally debating this earlier. The bug is already exposed by >any combining characters that don't have precomposed forms. It also >doesn't show up with the default (i.e. non TrueType) fonts. Given that >and how unfriendly the precomposition behavior is, I think disabling it >is still reasonable. I'd agree with that. TrueType fonts are not the default. I think it's more important to get copy-paste to work the way one would expect it to work (even if it displays the characters the wrong way). Philippe
Re: xterm(1) changing UTF-8 characters when copy-pasting?
On Fri, Dec 01, 2017 at 12:14:48PM +0100, Ingo Schwarze wrote: > Hi Anthony, > > Anthony J. Bentley wrote on Thu, Nov 30, 2017 at 11:28:54PM -0700: > > > You'll need extra fonts once I finish my patch to add situationally > > appropriate emoji to all our manpages. > > I'm looking forward to that. Don't forget to make them animated, > make the colours fully configurable, and maybe add some nice > background music, a pleasant scent, and touchscreen support. And make them soft and plushy to the touch!
Re: xterm(1) changing UTF-8 characters when copy-pasting?
On Fri, Dec 1, 2017 at 11:38 AM, Stefan Sperling wrote: > On Fri, Dec 01, 2017 at 12:14:48PM +0100, Ingo Schwarze wrote: > > Anthony J. Bentley wrote on Thu, Nov 30, 2017 at 11:28:54PM -0700: > > > > > You'll need extra fonts once I finish my patch to add situationally > > > appropriate emoji to all our manpages. > > > > I'm looking forward to that. Don't forget to make them animated, > > make the colours fully configurable, and maybe add some nice > > background music, a pleasant scent, and touchscreen support. > > And make them soft and plushy to the touch! > Or spiney and plushy, for when we switch the manpage footer from saying "OpenBSD 6.2" to " 6.2"!