When it comes to these discussions I prefer to go back to the standards and not just looking at the surrounding discussions. The standard[0] states the following in section 23.2: Hyphenation. U+00AD soft hyphen (SHY ) indicates an intraword break point, where aline break is preferred if a word must be hyphenated or otherwise broken across lines. Suchbreak points are generally determined by an automatic hyphenator. SHY can be used withany script, but its use is generally limited to situations where users need to override thebehavior of such a hyphenator. The visible rendering of a line break at an intraword breakpoint, whether automatically determined or indicated by a SHY, depends on the surrounding characters, the rules governing the script and language used, and, at times, the meaningof the word. The precise rules are outside the scope of this standard, but see Unicode Stan-dard Annex #14, "Unicode Line Breaking Algorithm," for additional information. A com-mon default rendering is to insert a hyphen before the line break, but this is insufficient or even incorrect in many situations
Where Annex #14 section 5.4[1] states begins with: Unlike U+2010 HYPHEN, which always has a visible rendition, the character U+00AD SOFT HYPHEN (SHY) is an invisible format character that merely indicates a preferred intraword line break position ... Depending on the language and the word, that may produce different visible results[2] So going by this phrase the character should not be printed and have no incluence on the text if it´s not used as a linebreak. The problem arises on how the terminal handles this character. In the case of xterm it appears to always print the character (printf "\302\255"), which according to Annex #14 is wrong. If you were to use another terminal which honours the this guideline OpenBSD would be correct and glibc etc is wrong. There´s also something to say for the way FreeBSD handles it, but that would break things even more on some OpenBSD applications, like ls(1), where a wcwidth of -1 would print a ´?´, which is even worse. Maybe this should be revisited and just skip these characters completely, but that´s outside the scope of this discussion. In conclusion: As long as the output device isn´t the database used to determine how things are displayed there´s no 100% guarantee that the software calculating the column width is doing the right thing. However, based on the description by the Unicode Consortium I think OpenBSD does the right thing and xterm and others should be fixed, especially if they just do a dumb printing of the characters without taking the proper line breaking rules into account and just keep on printing until the end of the screen and then continue on the next line. This goes double if the printing of the hyphen must cause visible changes (like spelling) according to the language rules. martijn@ On Thu, 2021-04-01 at 08:27 +0300, Lauri Tirkkonen wrote: > When using terminal software on non-OpenBSD to connect to my OpenBSD IRC > machine, I noticed that sometimes the local terminal disagrees with the remote > tmux and application (in this case, irssi) about the character width of some > lines, causing different kinds of breakage. Those lines happened to contain > soft > hyphens (U+00AD), which behave as follows across a few different operating > systems: > > OpenBSD-CURRENT: iswprint(SHY) = 1 iswcntrl(SHY) = 1 wcwidth(SHY) = 0 > NetBSD 9.1: iswprint(SHY) = 1 iswcntrl(SHY) = 0 wcwidth(SHY) = 1 > FreeBSD 12.2: iswprint(SHY) = 0 iswcntrl(SHY) = 1 wcwidth(SHY) = -1 > glibc (Debian sid): iswprint(SHY) = 1 iswcntrl(SHY) = 0 wcwidth(SHY) = 1 > musl (Alpine 3.13.3): iswprint(SHY) = 1 iswcntrl(SHY) = 0 wcwidth(SHY) = 1 > > On Windows, PowerShell, PuTTY and MinTTY (shipped with the default install of > git from git-scm.com as part of MSYS2) render the soft hyphen as a visible > character with a width of 1 column. > > The OpenBSD wcwidth(SHY) of 0 is what the problem comes down to (FreeBSD's > return values are also strange, but this is an OpenBSD list). There is a lot > of > background discussion about whether or not Unicode intends the SHY to be > printable or not, and whether it should have width of 0 or 1, in eg. [0] and > [1], but for better or worse, it seems most other systems decided that SHY > has a > width of 1 and should be a visible character (at least in terminal contexts). > > Therefore, in the interest of interoperability, I propose the following diff > to > special-case SHY into having a width of 1. I don't intend to go down the > rabbit > hole of a discussion regarding what the 'correct' width is, but the > discrepancy > with other systems causes real problems, and I think those other systems made > their decisions years ago (see eg. [0] for glibc). > > Diff below only for gen_ctype_utf8.pl; I am not including the resulting > en_US.UTF-8.src diff, because it seems there is a Unicode 12.1.0 to 13.0.0 > update that happens on regeneration of that file, and that is orthogonal to > this > change (essentially: [2], which has not been committed yet) > > [0]: https://sourceware.org/bugzilla/show_bug.cgi?id=22073 > [1]: https://jkorpela.fi/shy.html > [2]: https://marc.info/?l=openbsd-tech&m=161534047428793&w=2 > > diff --git a/share/locale/ctype/gen_ctype_utf8.pl > b/share/locale/ctype/gen_ctype_utf8.pl > index e23472efb2c..c593dc628ee 100755 > --- a/share/locale/ctype/gen_ctype_utf8.pl > +++ b/share/locale/ctype/gen_ctype_utf8.pl > @@ -404,6 +404,9 @@ sub codepoint_columns > > # Several fonts provide glyphs in this range > return 1 if $code >= 0xe000 and $code <= 0xf8ff; > + # Soft hyphen (SHY) is in category Cf, which implies width 0, but > since > + # it is width 1 in nearly every other environment, set it here. > + return 1 if $code == 0x00ad; > > return 0 if $charinfo->{category} eq 'Mn'; > return 0 if $charinfo->{category} eq 'Me'; > [0] https://www.unicode.org/versions/Unicode13.0.0/UnicodeStandard-13.0.pdf [1] https://www.unicode.org/reports/tr14/tr14-45.html#SoftHyphen [2] There´s more nuance that must be looked at before jumping to conclusions. But that would be overkill for this mail.