Re: wcwidth of soft hyphen
On Thu, 2021-04-15 at 14:20 +0200, Martijn van Duren wrote: > I did some archeology today and found that it used to behave as > non-printable, but it got broken in release 334 (august 2018), when > CharWidth was introduced. Before that my_wcwidth was used directly. > > Since there doesn't appear to be a repository with commit messages I'm > not 100% sure why this macro was introduced. My best guess at this > point would be the following line from the xterm.log.html: > several minor performance improvements using macros, e.g., inline checks > for character width. > Which would imply that it is indeed a bug in xterm. > I mailed Thomas Dickey to ask his view on the situation and maybe get > some context. Answer pending. Answer received: It's as I described it and fixing it is on his todo list. > > On Wed, 2021-04-14 at 21:25 +0200, Martijn van Duren wrote: > > On Wed, 2021-04-14 at 20:10 +0300, Lauri Tirkkonen wrote: > > > Since the discussion seems to have died out, I take my patch will not be > > > accepted. > > > > > > The decision appears to be that OpenBSD is right and everyone else is > > > wrong in > > > this matter. Given that, and the calls to change the behavior of other > > > OSes and > > > terminal emulators around SHY: are you going to at least patch xterm > > > in-tree so > > > that it does not render SHY? > > > > > > Or must it remain broken? > > > > > Looking closer at the xterm source corroborated my previous reasoning. > > From xterm's wcwidth.c: > > /* > > * Provide a way to change the behavior of soft-hyphen. > > */ > > void mk_wcwidth_init(int mode) > > { > > use_latin1 = (mode == 0); > > } > > > > and > > > > * - SOFT HYPHEN (U+00AD) has a column width of 1 in Latin-1, 0 in > > Unicode. > > * An initialization function is used to switch between the two. > > > > So it is the intention of xterm to not display the soft hyphen in > > unicode mode. > > > > This is also corrobarated by charproc.c5799 where the error occurs: > > if (ch == 0xad) { > > /* > > > > * Only display soft-hyphen if it happens to be > > at > > * the right-margin. While that means that only > > * the displayed character could be selected for > > * pasting, a well-behaved application would > > never > > * send this, anyway... > > */ > > > > The problem here is that on line 5795 we have: > > last_chomp = CharWidth(buf[n]); > > which expands to: > > CharWidth(n) (((n) < 256) ? (IsLatin1(n) ? 1 : 0) : my_wcwidth((wchar_t) > > (n))) > > and > > #define IsLatin1(n) (((n) >= 32 && (n) <= 126) || ((n) >= 160 && (n) <= > > 255)) > > > > So here's the big oops: CharWidth doesn't know we're in UTF-8 mode and > > we never reach my_wcwidth. > > > > Diff below fixes this behaviour for me and restores the printing > > behaviour when I run xterm with +u8 to reset utf-8 mode. > > However, I'm no xterm hacker and it's quite a beast, so this needs > > proper testing and scrutiny from someone who knows the code to make > > sure there's no use of uninitialized variables. (CC matthieu@) > > > > No intention of pushing this for 6.9, but maybe someone brave is > > willing to dive in here after me. > > > > martijn@ > > > > Index: charproc.c > > === > > RCS file: /cvs/xenocara/app/xterm/charproc.c,v > > retrieving revision 1.49 > > diff -u -p -r1.49 charproc.c > > --- charproc.c 2 Apr 2021 18:44:19 - 1.49 > > +++ charproc.c 14 Apr 2021 19:24:14 - > > @@ -2305,7 +2305,7 @@ doparsing(XtermWidget xw, unsigned c, st > > */ > > if (c >= 0x300 > > && screen->wide_chars > > - && CharWidth(c) == 0 > > + && CharWidth(screen, c) == 0 > > && !isWideControl(c)) { > > int prev, test; > > Boolean used = True; > > @@ -2330,9 +2330,9 @@ doparsing(XtermWidget xw, unsigned c, st > > prev = (int) XTERM_CELL(use_row, use_col); > > test = do_precomposition(prev, (int) c); > > TRACE(("do_precomposition (U+%04X [%d], U+%04X [%d]) -> > > U+%04X [%d]\n", > > - prev, CharWidth(prev), > > - (int) c, CharWidth(c), > > - test, CharWidth(test))); > > + prev, CharWidth(screen, prev), > > + (int) c, CharWidth(screen, c), > > + test, CharWidth(screen, test))); > > } else { > > prev = -1; > > test = -1; > > @@ -2342,7 +2342,7 @@ doparsing(XtermWidget xw, unsigned c, st > > * only if it does not change the width o
Re: wcwidth of soft hyphen
I did some archeology today and found that it used to behave as non-printable, but it got broken in release 334 (august 2018), when CharWidth was introduced. Before that my_wcwidth was used directly. Since there doesn't appear to be a repository with commit messages I'm not 100% sure why this macro was introduced. My best guess at this point would be the following line from the xterm.log.html: several minor performance improvements using macros, e.g., inline checks for character width. Which would imply that it is indeed a bug in xterm. I mailed Thomas Dickey to ask his view on the situation and maybe get some context. Answer pending. On Wed, 2021-04-14 at 21:25 +0200, Martijn van Duren wrote: > On Wed, 2021-04-14 at 20:10 +0300, Lauri Tirkkonen wrote: > > Since the discussion seems to have died out, I take my patch will not be > > accepted. > > > > The decision appears to be that OpenBSD is right and everyone else is wrong > > in > > this matter. Given that, and the calls to change the behavior of other OSes > > and > > terminal emulators around SHY: are you going to at least patch xterm > > in-tree so > > that it does not render SHY? > > > > Or must it remain broken? > > > Looking closer at the xterm source corroborated my previous reasoning. > From xterm's wcwidth.c: > /* > * Provide a way to change the behavior of soft-hyphen. > */ > void mk_wcwidth_init(int mode) > { > use_latin1 = (mode == 0); > } > > and > > * - SOFT HYPHEN (U+00AD) has a column width of 1 in Latin-1, 0 in Unicode. > * An initialization function is used to switch between the two. > > So it is the intention of xterm to not display the soft hyphen in > unicode mode. > > This is also corrobarated by charproc.c5799 where the error occurs: > if (ch == 0xad) { > /* > > * Only display soft-hyphen if it happens to be at > * the right-margin. While that means that only > * the displayed character could be selected for > * pasting, a well-behaved application would never > * send this, anyway... > */ > > The problem here is that on line 5795 we have: > last_chomp = CharWidth(buf[n]); > which expands to: > CharWidth(n) (((n) < 256) ? (IsLatin1(n) ? 1 : 0) : my_wcwidth((wchar_t) (n))) > and > #define IsLatin1(n) (((n) >= 32 && (n) <= 126) || ((n) >= 160 && (n) <= 255)) > > So here's the big oops: CharWidth doesn't know we're in UTF-8 mode and > we never reach my_wcwidth. > > Diff below fixes this behaviour for me and restores the printing > behaviour when I run xterm with +u8 to reset utf-8 mode. > However, I'm no xterm hacker and it's quite a beast, so this needs > proper testing and scrutiny from someone who knows the code to make > sure there's no use of uninitialized variables. (CC matthieu@) > > No intention of pushing this for 6.9, but maybe someone brave is > willing to dive in here after me. > > martijn@ > > Index: charproc.c > === > RCS file: /cvs/xenocara/app/xterm/charproc.c,v > retrieving revision 1.49 > diff -u -p -r1.49 charproc.c > --- charproc.c 2 Apr 2021 18:44:19 - 1.49 > +++ charproc.c 14 Apr 2021 19:24:14 - > @@ -2305,7 +2305,7 @@ doparsing(XtermWidget xw, unsigned c, st > */ > if (c >= 0x300 > && screen->wide_chars > - && CharWidth(c) == 0 > + && CharWidth(screen, c) == 0 > && !isWideControl(c)) { > int prev, test; > Boolean used = True; > @@ -2330,9 +2330,9 @@ doparsing(XtermWidget xw, unsigned c, st > prev = (int) XTERM_CELL(use_row, use_col); > test = do_precomposition(prev, (int) c); > TRACE(("do_precomposition (U+%04X [%d], U+%04X [%d]) -> > U+%04X [%d]\n", > - prev, CharWidth(prev), > - (int) c, CharWidth(c), > - test, CharWidth(test))); > + prev, CharWidth(screen, prev), > + (int) c, CharWidth(screen, c), > + test, CharWidth(screen, test))); > } else { > prev = -1; > test = -1; > @@ -2342,7 +2342,7 @@ doparsing(XtermWidget xw, unsigned c, st > * only if it does not change the width of the base character > */ > if (test != -1 > - && CharWidth(test) == CharWidth(prev)) { > + && CharWidth(screen, test) == CharWidth(screen, prev)) { > putXtermCell(screen, use_row, use_col, test); > } else if (screen->char_was_written > || getXtermCell(screen, use_row, use_col
Re: wcwidth of soft hyphen
Since the discussion seems to have died out, I take my patch will not be accepted. The decision appears to be that OpenBSD is right and everyone else is wrong in this matter. Given that, and the calls to change the behavior of other OSes and terminal emulators around SHY: are you going to at least patch xterm in-tree so that it does not render SHY? Or must it remain broken? -- Lauri Tirkkonen | lotheac @ IRCnet
Re: wcwidth of soft hyphen
On Wed, 2021-04-14 at 20:10 +0300, Lauri Tirkkonen wrote: > Since the discussion seems to have died out, I take my patch will not be > accepted. > > The decision appears to be that OpenBSD is right and everyone else is wrong in > this matter. Given that, and the calls to change the behavior of other OSes > and > terminal emulators around SHY: are you going to at least patch xterm in-tree > so > that it does not render SHY? > > Or must it remain broken? > Looking closer at the xterm source corroborated my previous reasoning. >From xterm's wcwidth.c: /* * Provide a way to change the behavior of soft-hyphen. */ void mk_wcwidth_init(int mode) { use_latin1 = (mode == 0); } and *- SOFT HYPHEN (U+00AD) has a column width of 1 in Latin-1, 0 in Unicode. * An initialization function is used to switch between the two. So it is the intention of xterm to not display the soft hyphen in unicode mode. This is also corrobarated by charproc.c5799 where the error occurs: if (ch == 0xad) { /* * Only display soft-hyphen if it happens to be at * the right-margin. While that means that only * the displayed character could be selected for * pasting, a well-behaved application would never * send this, anyway... */ The problem here is that on line 5795 we have: last_chomp = CharWidth(buf[n]); which expands to: CharWidth(n) (((n) < 256) ? (IsLatin1(n) ? 1 : 0) : my_wcwidth((wchar_t) (n))) and #define IsLatin1(n) (((n) >= 32 && (n) <= 126) || ((n) >= 160 && (n) <= 255)) So here's the big oops: CharWidth doesn't know we're in UTF-8 mode and we never reach my_wcwidth. Diff below fixes this behaviour for me and restores the printing behaviour when I run xterm with +u8 to reset utf-8 mode. However, I'm no xterm hacker and it's quite a beast, so this needs proper testing and scrutiny from someone who knows the code to make sure there's no use of uninitialized variables. (CC matthieu@) No intention of pushing this for 6.9, but maybe someone brave is willing to dive in here after me. martijn@ Index: charproc.c === RCS file: /cvs/xenocara/app/xterm/charproc.c,v retrieving revision 1.49 diff -u -p -r1.49 charproc.c --- charproc.c 2 Apr 2021 18:44:19 - 1.49 +++ charproc.c 14 Apr 2021 19:24:14 - @@ -2305,7 +2305,7 @@ doparsing(XtermWidget xw, unsigned c, st */ if (c >= 0x300 && screen->wide_chars - && CharWidth(c) == 0 + && CharWidth(screen, c) == 0 && !isWideControl(c)) { int prev, test; Boolean used = True; @@ -2330,9 +2330,9 @@ doparsing(XtermWidget xw, unsigned c, st prev = (int) XTERM_CELL(use_row, use_col); test = do_precomposition(prev, (int) c); TRACE(("do_precomposition (U+%04X [%d], U+%04X [%d]) -> U+%04X [%d]\n", - prev, CharWidth(prev), - (int) c, CharWidth(c), - test, CharWidth(test))); + prev, CharWidth(screen, prev), + (int) c, CharWidth(screen, c), + test, CharWidth(screen, test))); } else { prev = -1; test = -1; @@ -2342,7 +2342,7 @@ doparsing(XtermWidget xw, unsigned c, st * only if it does not change the width of the base character */ if (test != -1 - && CharWidth(test) == CharWidth(prev)) { + && CharWidth(screen, test) == CharWidth(screen, prev)) { putXtermCell(screen, use_row, use_col, test); } else if (screen->char_was_written || getXtermCell(screen, use_row, use_col) >= ' ') { @@ -4551,7 +4551,7 @@ doparsing(XtermWidget xw, unsigned c, st value = zero_if_default(0); TRACE(("CASE_DECFRA - Fill rectangular area\n")); - if (nparam > 0 && CharWidth(value) > 0) { + if (nparam > 0 && CharWidth(screen, value) > 0) { xtermParseRect(xw, ParamPair(1), &myRect); ScrnFillRectangle(xw, &myRect, value, xw->flags, True); } @@ -4860,7 +4860,7 @@ doparsing(XtermWidget xw, unsigned c, st case CASE_REP: TRACE(("CASE_REP\n")); - if (CharWidth(sp->lastchar) > 0) { + if (CharWidth(screen, sp->lastchar) > 0) { IChar repeated[2]; count = one_if_default(0); repeated[0] = (IChar) sp->lastchar; @@ -5792,7 +5792,7 @@ dotext(XtermWidget xw, buf[n] <= 0xa0) {
Re: wcwidth of soft hyphen
On Tue, Apr 06 2021 11:27:21 +0100, Stuart Henderson wrote: > Some terminal emulators are using iso-8859-1 semantics of soft hyphen, > unicode did things differently but those terminals haven't changed. > > xterm printed as hyphen > mltermprinted as hyphen > putty printed as hyphen > urxvt overprinted on previous character > stnot printed, no space > kitty not printed, no space > cool-retro-term not printed, no space > sakuraprinted as space st actually relies on wcwidth(), so on Debian (for example) it prints the SHY as a hyphen. > Pragmatically the simplest fix for the original problem might be if > irssi filtered out soft-hyphen characters like mutt does in its > "is_display_corrupting_utf8()" function: > > https://gitlab.com/muttmua/mutt/-/blob/master/mbyte.c#L528 Thanks, it's news to me that mutt does that. It speaks to something when an application is explicitly hardcoding codepoints not to print. I don't particularly like the 'solution' of every TUI application having to ship their own fixes for stuff like this though. -- Lauri Tirkkonen | lotheac @ IRCnet
Re: wcwidth of soft hyphen
On Tue, Apr 06 2021 13:09:11 +0200, Martijn van Duren wrote: > On Thu, 2021-04-01 at 10:39 +0300, Lauri Tirkkonen wrote: > > On Thu, Apr 01 2021 09:30:36 +0200, Martijn van Duren wrote: > > > However, based on the description by the Unicode Consortium I think > > > OpenBSD does the right thing and xterm and others should be fixed, > > > > practically, I doubt this will happen. I don't think the glibc people will > > be > > convinced to break compatibility to their older versions, for example. I > > explicitly mentioned I don't wish to engage in a discussion about which way > > is > > _correct_ - I am interested in interoperability with real, existing systems. > > > I´m not convinced that you´ve shown that it´s actually an > interoperability issue. In your last mail you state that it´s a simple > display difference between tmux and raw xterm on OpenBSD. To me that´s > similar to most linux distro´s having grep being an alias for > grep --color=auto by default and stating that we should do the same > because you like pretty colours. What applications fail to operate or > operate in an severely erroneous way because of this discrepency? I'll try again to describe the problem, and show an example. TUI applications often care, for layout purposes, how long a particular string or line will be on the output device. A not insignificant number of those applications use wcwidth() to figure out how much column space will be taken by a certain character or string. If the application performing the width calculations is running on a different machine than the terminal, say, through ssh, it is important that the application's idea of width matches what the terminal will eventually render; if it doesn't, then the application could print the string over some other TUI element, for example. This is difficult and messy for many reasons already discussed, especially when different operating systems disagree about the width or printability of a character. Nevertheles , in 2021, wcwidth implementations mostly agree and even things like emojis get a wcwidth of 2 everywhere I've observed (in contrast to some -1 wcwidths of printable characters I observed on other OSes in the past). But SHY seems to still be something that causes issues in terminals, at least for me. As the example, I ran the command "/exec cat longshy.txt /etc/motd" inside of irssi, in a 80x24 terminal window, with a few different terminal/OS-running-terminal/OS-running-irssi configurations. 'longshy.txt' is available at https://hacktheplanet.fi/shy/longshy.txt Let's start with st(1), since it's simple and uses wcwidth() directly to decide how wide a character should be printed: st on OpenBSD, local irssi https://hacktheplanet.fi/shy/st-openbsd-local.png st on Debian, local irssi https://hacktheplanet.fi/shy/st-debian-local.png Here we can see two key things: 1) on Debian, st is rendering the SHY characters - on OpenBSD it is not 2) on Debian, irssi considers the line long enough that it splits it and prints the remainder on the next line, indented So, let's introduce ssh into the mix: st on OpenBSD, irssi on Debian https://hacktheplanet.fi/shy/st-openbsd-ssh-debian.png st on Debian, irssi on OpenBSD https://hacktheplanet.fi/shy/st-debian-ssh-openbsd.png We begin to see differences that stem from wcwidth(SHY). These problems aren't very big, since in both cases the output is still readable and no information is lost. Now, let's try xterm(1). It has been observed in this thread that xterm always prints SHY. xterm on OpenBSD, local irssi https://hacktheplanet.fi/shy/xterm-openbsd-local.png xterm on Debian, local irssi https://hacktheplanet.fi/shy/xterm-debian-local.png On OpenBSD, irssi thinks that the entire line fits into the 80 columns available. But because xterm prints SHYs, the line overflows onto the next and is promptly overwritten by the next line that irssi puts there (the motd). And finally ssh with xterm: xterm on OpenBSD, irssi on Debian https://hacktheplanet.fi/shy/xterm-openbsd-ssh-debian.png xterm on Debian, irssi on OpenBSD https://hacktheplanet.fi/shy/xterm-debian-ssh-openbsd.png This isn't the best example: there are many different problems that can arise from the width calculation discrepancy - some of them can be more spectacular I think, but I could only come up with this one on demand. Despite the bad example, I do consider cases where text messes up in ways the application did not intend (in the worst case, overwriting other text) on the same terminal on different operating systems interoperability bugs. In this case the outputs are different due to interactions between systems that use wcwidth(SHY) = 1 (such as, apparently, xterm even locally) and OpenBSD. I might not say it is "operating in a severely erroneous way", but then I don't consider "severely erroneous" as a requirement to fix issues. > If you want to show a hyphen in your text, use a hyphen. If you want to > indicate where a word might be broke
Re: wcwidth of soft hyphen
On Tue, 2021-04-06 at 13:27 +0100, Stuart Henderson wrote: > On 2021/04/06 13:09, Martijn van Duren wrote: > > I´m also not convinced that the other wcwidth implementations might be > > on to something and that the unicode consortium is having inertia > > problems. > > The difficulty is that it isn't *possible* to give a single correct > answer for the width of SHY, it varies and can only be identified > when other information about the terminal is taken into account (how > the terminal behaves and whether the word currently printed is being > wrapped), which is out of scope for wcwidth(3). So no surprise > different people come up with a different way to handle it. My statement is that we have xterm in UTF-8 mode and we only support ASCII/UTF-8 in base. So we should use the unicode definitions. They state that a SHY should only be replaced by a hyphen on the end of the line and taking localized grammar rules into account. Since the shell never looks at ZWSP/SHY/whatever character for breaking up a word over multiple lines it should *never* be visible on the shell making our definition of 0 width always correct. If an application uses it to break a word over two lines it needs to take the local grammar into account, potentially changing the surrounding characters. In that case the application only uses it as an indicator of the hyphenated breakup and should place an actual hyphen there itself, making the SHY still only an invisible indicator with width 0. > > > If you want to show a hyphen in your text, use a hyphen. If you want to > > indicate where a word might be broken up in a hyphenated way across two > > lines if the software knows the localized grammar rules use a SHY. > > Also thanks to sthen@ for pointing out where the confusion comes from: > > we´re using UTF-8 here, not ISO-8859-1, so we must make sure that we > > use the UTF-8 definitions. > > but, guess what happens when text is converted from ISO-8859-1 to UTF-8... > > $ printf '\xad' | iconv -f iso-8859-1 -t utf-8 | hexdump -C > c2 ad |..| > If ISO-8859-1 SHY has no 1-on-1 counterpart in unicode I´d probably choose the same conversion. That doesn´t make them equal, just a close enough aproximation for automated tooling.
Re: wcwidth of soft hyphen
On 2021/04/06 13:09, Martijn van Duren wrote: > I´m also not convinced that the other wcwidth implementations might be > on to something and that the unicode consortium is having inertia > problems. The difficulty is that it isn't *possible* to give a single correct answer for the width of SHY, it varies and can only be identified when other information about the terminal is taken into account (how the terminal behaves and whether the word currently printed is being wrapped), which is out of scope for wcwidth(3). So no surprise different people come up with a different way to handle it. > If you want to show a hyphen in your text, use a hyphen. If you want to > indicate where a word might be broken up in a hyphenated way across two > lines if the software knows the localized grammar rules use a SHY. > Also thanks to sthen@ for pointing out where the confusion comes from: > we´re using UTF-8 here, not ISO-8859-1, so we must make sure that we > use the UTF-8 definitions. but, guess what happens when text is converted from ISO-8859-1 to UTF-8... $ printf '\xad' | iconv -f iso-8859-1 -t utf-8 | hexdump -C c2 ad |..|
Re: wcwidth of soft hyphen
On Mon, 2021-04-05 at 20:30 +0200, Ingo Schwarze wrote: > Hi, > > Martijn van Duren wrote on Thu, Apr 01, 2021 at 09:30:36AM +0200: > > So going by this phrase the character should not be printed > > When formatting a document, for example for printing on paper or > the online equivalent like PostScript or PDF, i agree. But i > strongly prefer the terminal to always display this character because > the terminal's usual purpose is not nice text formatting for visual > consumption. It should usually show the full content of strings > or files, be it for inspection or for editing. Omitting characters > in such contexts sets nasty traps for the person working with the > terminal. > > So i say nothing should be changed at all in OpenBSD. > > Yes, that means column counting is wrong on the terminal, but that's > a very minor problem, if it's a problem at all, compared to the havoc > that could result from not showing the character on the terminal at > all, and it cannot be fixed without causing worse problems in situations > that matter more. I disagree with you here. As sthen@ just pointed out this is most likely a legacy print from ISO-8559-1 which uses a different definition of SHY. Saying that not showing a character on the terminal at all can cause havoc also have different implications: we would have to start printing ZWSP and have to make a stronger distinction between tab and space. And that´s just a few examples top of the head. If you want to see the actual text you´re working with you need something like vis(1), hexdump(1), or something more sophisticated for UTF-8. We claim we support UTF-8, so we should use the unicode consortium definitions. Especially if they make linguistic sense; which it does. > > The bug in NetBSD and Linux should be fixed, but that's off-topic here. And I´d like to add terminals in unicode mode to that list. > > Yours, > Ingo martijn@
Re: wcwidth of soft hyphen
On Thu, 2021-04-01 at 10:39 +0300, Lauri Tirkkonen wrote: > On Thu, Apr 01 2021 09:30:36 +0200, Martijn van Duren wrote: > > However, based on the description by the Unicode Consortium I think > > OpenBSD does the right thing and xterm and others should be fixed, > > practically, I doubt this will happen. I don't think the glibc people will be > convinced to break compatibility to their older versions, for example. I > explicitly mentioned I don't wish to engage in a discussion about which way is > _correct_ - I am interested in interoperability with real, existing systems. > I´m not convinced that you´ve shown that it´s actually an interoperability issue. In your last mail you state that it´s a simple display difference between tmux and raw xterm on OpenBSD. To me that´s similar to most linux distro´s having grep being an alias for grep --color=auto by default and stating that we should do the same because you like pretty colours. What applications fail to operate or operate in an severely erroneous way because of this discrepency? I´m also not convinced that the other wcwidth implementations might be on to something and that the unicode consortium is having inertia problems. In my previous mail I quoted on what linguistic constructs the character is based and that it is invisible. To stick with their example: I write "opaatje" or "opa-" LF "tje", not "opaa-tje". If you want to show a hyphen in your text, use a hyphen. If you want to indicate where a word might be broken up in a hyphenated way across two lines if the software knows the localized grammar rules use a SHY. Also thanks to sthen@ for pointing out where the confusion comes from: we´re using UTF-8 here, not ISO-8859-1, so we must make sure that we use the UTF-8 definitions. martijn@
Re: wcwidth of soft hyphen
On 2021/04/05 12:45, Theo de Raadt wrote: > So, your argument is that displays should remain broken forever. > > > The bug in NetBSD and Linux should be fixed, but that's off-topic here. > > If you cannot explain how this problem is going to be fixed (reversed) > in these opposing ecosystems (and it is not just Linux and NetBSD), then > you've closed the argument with a cop-out. > > It cannot be off-topic. > > It is an interop problem which must be settled. > > From time to time, defacto standards arise which have inertia that > is too great to fight. > > Your position seems to me that original standard are etched in stone and > it is impossible to have new defacto standards arise, and if interop > issues arrive, screw everyone -- can't they see there is a stone? Some terminal emulators are using iso-8859-1 semantics of soft hyphen, unicode did things differently but those terminals haven't changed. xterm printed as hyphen mlterm printed as hyphen putty printed as hyphen urxvt overprinted on previous character st not printed, no space kitty not printed, no space cool-retro-term not printed, no space sakura printed as space There's some more about this on https://jkorpela.fi/shy.html, it's all a mess. Pragmatically the simplest fix for the original problem might be if irssi filtered out soft-hyphen characters like mutt does in its "is_display_corrupting_utf8()" function: https://gitlab.com/muttmua/mutt/-/blob/master/mbyte.c#L528
Re: wcwidth of soft hyphen
Hi Ingo, On Mon, Apr 05 2021 20:30:39 +0200, Ingo Schwarze wrote: > Whether all control chars are always width 0 can maybe also be > disputed. Again, the stronger argument seems to me that they are. > If they weren't, they would not be control characters but alphanumeric, > punctuation, spaces, or special printable characters, none of which > they are. I say width 1 and 2 require standalone glyphs that are > normally used for the character. Besides, no operating system > correctly identifies this as a control character and yet gives it > width 1. I agree with your assessments about iswcntrl and iswprint. My original patch proposed to keep those return values as is, and only change the wcwidth() from 0 to 1. > I insist that the discussion should remain very strictly formal, > about the properties and classification in the Unicode data files > and nothing else. If people start arguing about what makes sense > for any particular character, that's already an argument going > astray. In general I agree, but I contend that SHY is, unfortunately, a little bit special. This confusion about its printability and/or column width is definitely not unique to OpenBSD. > > So going by this phrase the character should not be printed > > When formatting a document, for example for printing on paper or > the online equivalent like PostScript or PDF, i agree. But i > strongly prefer the terminal to always display this character because > the terminal's usual purpose is not nice text formatting for visual > consumption. It should usually show the full content of strings > or files, be it for inspection or for editing. Omitting characters > in such contexts sets nasty traps for the person working with the > terminal. I agree with this completely - you said it better than I could have. This is another reason why I think it makes sense for this character to have wcwidth() of 1 - applications that are "SHY-aware" can print (or not) the soft hyphen however they wish, but terminal software seems to almost always ask wcwidth() to figure out the column width. Indeed, terminal software is where I ran into the problem of SHY sometimes being invisible. > So i say nothing should be changed at all in OpenBSD. > > Yes, that means column counting is wrong on the terminal, but that's > a very minor problem, if it's a problem at all, compared to the havoc > that could result from not showing the character on the terminal at > all, and it cannot be fixed without causing worse problems in situations > that matter more. Right, I am not at all advocating to hide the SHY on the terminal - quite the contrary, I want to make its width consistent. The current situation, with SHY having a wcwidth() of 0 causes, for example, the following discrepancy between xterm(1) (on the left) and tmux(1) in xterm(1) (on the right): https://hacktheplanet.fi/shytmux.png -- in tmux, the SHY is not visible. The other issues I observed with discrepancies between OpenBSD and other systems I already outlined in my initial mail. -- Lauri Tirkkonen | lotheac @ IRCnet
Re: wcwidth of soft hyphen
So, your argument is that displays should remain broken forever. > The bug in NetBSD and Linux should be fixed, but that's off-topic here. If you cannot explain how this problem is going to be fixed (reversed) in these opposing ecosystems (and it is not just Linux and NetBSD), then you've closed the argument with a cop-out. It cannot be off-topic. It is an interop problem which must be settled. >From time to time, defacto standards arise which have inertia that is too great to fight. Your position seems to me that original standard are etched in stone and it is impossible to have new defacto standards arise, and if interop issues arrive, screw everyone -- can't they see there is a stone? The position statement translates simply to: It must remain broken. Ingo Schwarze wrote: > Hi, > > Martijn van Duren wrote on Thu, Apr 01, 2021 at 09:30:36AM +0200: > > > When it comes to these discussions I prefer to go back to the standards > > I would propose an even more rigorous stance: not only go back to > the standards, but use whatever the Unicode data files (indirectly, > via the Perl modules) parsed by gen_ctype_utf8.pl specify. Manually > changing properties of individual characters should be restricted > to very rare cases of crystal clear, absolutely unambiguous errors. > When there is the slightest doubt or when there are arguments both > ways, follow the Unicode data files and how Perl interprets them. > > We have > > iswcntrl = 1 because UnicodeData.txt has class Cf (format control char) > iswprint = 1 because the class is neither Cc nor Cs > wcwidth = 0 because the class starts with C (control char) > > This is also neither obviously nor unambiguously wrong, so it should > not be changed. > > The choice of iswcntrl = 1 is most definitely correct because > that's what class Cf says, there can be no doubt about that at all. > Consequently, NetBSD, glibc, and musl are definitely buggy in so far > as they return iswcntrl = 0. > > Whether class Cf is always printable is maybe not absolutely clear. > There are arguments both ways. The stronger argument seems to be > that these format control chars usually appear in the middle of > printable characters and they are printed together with the > surrounding characters. But maybe the FreeBSD argument that > some of them are sometimes not ptinted and hence iswprint = 0 > can also be made, though somewhat dubiously because sometimes > they are printed. Besides, which property would you use for > deciding printability? Please, don't resort to deciding that > character-by-character. > > Whether all control chars are always width 0 can maybe also be > disputed. Again, the stronger argument seems to me that they are. > If they weren't, they would not be control characters but alphanumeric, > punctuation, spaces, or special printable characters, none of which > they are. I say width 1 and 2 require standalone glyphs that are > normally used for the character. Besides, no operating system > correctly identifies this as a control character and yet gives it > width 1. > > I insist that the discussion should remain very strictly formal, > about the properties and classification in the Unicode data files > and nothing else. If people start arguing about what makes sense > for any particular character, that's already an argument going > astray. > > > > So going by this phrase the character should not be printed > > When formatting a document, for example for printing on paper or > the online equivalent like PostScript or PDF, i agree. But i > strongly prefer the terminal to always display this character because > the terminal's usual purpose is not nice text formatting for visual > consumption. It should usually show the full content of strings > or files, be it for inspection or for editing. Omitting characters > in such contexts sets nasty traps for the person working with the > terminal. > > So i say nothing should be changed at all in OpenBSD. > > Yes, that means column counting is wrong on the terminal, but that's > a very minor problem, if it's a problem at all, compared to the havoc > that could result from not showing the character on the terminal at > all, and it cannot be fixed without causing worse problems in situations > that matter more. > > The bug in NetBSD and Linux should be fixed, but that's off-topic here. > > Yours, > Ingo >
Re: wcwidth of soft hyphen
Hi, Martijn van Duren wrote on Thu, Apr 01, 2021 at 09:30:36AM +0200: > When it comes to these discussions I prefer to go back to the standards I would propose an even more rigorous stance: not only go back to the standards, but use whatever the Unicode data files (indirectly, via the Perl modules) parsed by gen_ctype_utf8.pl specify. Manually changing properties of individual characters should be restricted to very rare cases of crystal clear, absolutely unambiguous errors. When there is the slightest doubt or when there are arguments both ways, follow the Unicode data files and how Perl interprets them. We have iswcntrl = 1 because UnicodeData.txt has class Cf (format control char) iswprint = 1 because the class is neither Cc nor Cs wcwidth = 0 because the class starts with C (control char) This is also neither obviously nor unambiguously wrong, so it should not be changed. The choice of iswcntrl = 1 is most definitely correct because that's what class Cf says, there can be no doubt about that at all. Consequently, NetBSD, glibc, and musl are definitely buggy in so far as they return iswcntrl = 0. Whether class Cf is always printable is maybe not absolutely clear. There are arguments both ways. The stronger argument seems to be that these format control chars usually appear in the middle of printable characters and they are printed together with the surrounding characters. But maybe the FreeBSD argument that some of them are sometimes not ptinted and hence iswprint = 0 can also be made, though somewhat dubiously because sometimes they are printed. Besides, which property would you use for deciding printability? Please, don't resort to deciding that character-by-character. Whether all control chars are always width 0 can maybe also be disputed. Again, the stronger argument seems to me that they are. If they weren't, they would not be control characters but alphanumeric, punctuation, spaces, or special printable characters, none of which they are. I say width 1 and 2 require standalone glyphs that are normally used for the character. Besides, no operating system correctly identifies this as a control character and yet gives it width 1. I insist that the discussion should remain very strictly formal, about the properties and classification in the Unicode data files and nothing else. If people start arguing about what makes sense for any particular character, that's already an argument going astray. > So going by this phrase the character should not be printed When formatting a document, for example for printing on paper or the online equivalent like PostScript or PDF, i agree. But i strongly prefer the terminal to always display this character because the terminal's usual purpose is not nice text formatting for visual consumption. It should usually show the full content of strings or files, be it for inspection or for editing. Omitting characters in such contexts sets nasty traps for the person working with the terminal. So i say nothing should be changed at all in OpenBSD. Yes, that means column counting is wrong on the terminal, but that's a very minor problem, if it's a problem at all, compared to the havoc that could result from not showing the character on the terminal at all, and it cannot be fixed without causing worse problems in situations that matter more. The bug in NetBSD and Linux should be fixed, but that's off-topic here. Yours, Ingo
Re: wcwidth of soft hyphen
On Thu, Apr 01 2021 09:30:36 +0200, Martijn van Duren wrote: > However, based on the description by the Unicode Consortium I think > OpenBSD does the right thing and xterm and others should be fixed, practically, I doubt this will happen. I don't think the glibc people will be convinced to break compatibility to their older versions, for example. I explicitly mentioned I don't wish to engage in a discussion about which way is _correct_ - I am interested in interoperability with real, existing systems. -- Lauri Tirkkonen | lotheac @ IRCnet
Re: wcwidth of soft hyphen
When it comes to these discussions I prefer to go back to the standards and not just looking at the surrounding discussions. The standard[0] states the following in section 23.2: Hyphenation. U+00AD soft hyphen (SHY ) indicates an intraword break point, where aline break is preferred if a word must be hyphenated or otherwise broken across lines. Suchbreak points are generally determined by an automatic hyphenator. SHY can be used withany script, but its use is generally limited to situations where users need to override thebehavior of such a hyphenator. The visible rendering of a line break at an intraword breakpoint, whether automatically determined or indicated by a SHY, depends on the surrounding characters, the rules governing the script and language used, and, at times, the meaningof the word. The precise rules are outside the scope of this standard, but see Unicode Stan-dard Annex #14, "Unicode Line Breaking Algorithm," for additional information. A com-mon default rendering is to insert a hyphen before the line break, but this is insufficient or even incorrect in many situations Where Annex #14 section 5.4[1] states begins with: Unlike U+2010 HYPHEN, which always has a visible rendition, the character U+00AD SOFT HYPHEN (SHY) is an invisible format character that merely indicates a preferred intraword line break position ... Depending on the language and the word, that may produce different visible results[2] So going by this phrase the character should not be printed and have no incluence on the text if it´s not used as a linebreak. The problem arises on how the terminal handles this character. In the case of xterm it appears to always print the character (printf "\302\255"), which according to Annex #14 is wrong. If you were to use another terminal which honours the this guideline OpenBSD would be correct and glibc etc is wrong. There´s also something to say for the way FreeBSD handles it, but that would break things even more on some OpenBSD applications, like ls(1), where a wcwidth of -1 would print a ´?´, which is even worse. Maybe this should be revisited and just skip these characters completely, but that´s outside the scope of this discussion. In conclusion: As long as the output device isn´t the database used to determine how things are displayed there´s no 100% guarantee that the software calculating the column width is doing the right thing. However, based on the description by the Unicode Consortium I think OpenBSD does the right thing and xterm and others should be fixed, especially if they just do a dumb printing of the characters without taking the proper line breaking rules into account and just keep on printing until the end of the screen and then continue on the next line. This goes double if the printing of the hyphen must cause visible changes (like spelling) according to the language rules. martijn@ On Thu, 2021-04-01 at 08:27 +0300, Lauri Tirkkonen wrote: > When using terminal software on non-OpenBSD to connect to my OpenBSD IRC > machine, I noticed that sometimes the local terminal disagrees with the remote > tmux and application (in this case, irssi) about the character width of some > lines, causing different kinds of breakage. Those lines happened to contain > soft > hyphens (U+00AD), which behave as follows across a few different operating > systems: > > OpenBSD-CURRENT:iswprint(SHY) = 1 iswcntrl(SHY) = 1 wcwidth(SHY) = 0 > NetBSD 9.1: iswprint(SHY) = 1 iswcntrl(SHY) = 0 wcwidth(SHY) = 1 > FreeBSD 12.2: iswprint(SHY) = 0 iswcntrl(SHY) = 1 wcwidth(SHY) = -1 > glibc (Debian sid): iswprint(SHY) = 1 iswcntrl(SHY) = 0 wcwidth(SHY) = 1 > musl (Alpine 3.13.3): iswprint(SHY) = 1 iswcntrl(SHY) = 0 wcwidth(SHY) = 1 > > On Windows, PowerShell, PuTTY and MinTTY (shipped with the default install of > git from git-scm.com as part of MSYS2) render the soft hyphen as a visible > character with a width of 1 column. > > The OpenBSD wcwidth(SHY) of 0 is what the problem comes down to (FreeBSD's > return values are also strange, but this is an OpenBSD list). There is a lot > of > background discussion about whether or not Unicode intends the SHY to be > printable or not, and whether it should have width of 0 or 1, in eg. [0] and > [1], but for better or worse, it seems most other systems decided that SHY > has a > width of 1 and should be a visible character (at least in terminal contexts). > > Therefore, in the interest of interoperability, I propose the following diff > to > special-case SHY into having a width of 1. I don't intend to go down the > rabbit > hole of a discussion regarding what the 'correct' width is, but the > discrepancy > with other systems causes real problems, and I think those other systems made > their decisions years ago (see eg. [0] for glibc). > > Diff below only for gen_ctype_utf8.pl; I am not including the resulting > en_US.UTF-8.src diff, because it seems there is a Unicode 12.1.0 to 13.0.0 >