When it comes to these discussions I prefer to go back to the standards
and not just looking at the surrounding discussions.
The standard[0] states the following in section 23.2:
Hyphenation. U+00AD soft hyphen (SHY ) indicates an intraword break
point, where aline break is preferred if a word must be hyphenated or
otherwise broken across lines. Suchbreak points are generally determined
by an automatic hyphenator. SHY can be used withany  script,  but  its
use  is  generally  limited  to  situations  where  users  need  to
override  thebehavior of such a hyphenator. The visible rendering of a  
line break at an intraword breakpoint, whether automatically determined
or indicated by a SHY, depends on the surrounding characters, the rules
governing the script and language used, and, at times, the meaningof the
word. The precise rules are outside the scope of this standard, but see
Unicode Stan-dard Annex #14, "Unicode Line Breaking Algorithm," for 
additional information. A com-mon default rendering is to insert a 
hyphen before the line break, but this is insufficient or even incorrect 
in many situations

Where Annex #14 section 5.4[1] states begins with:
Unlike U+2010 HYPHEN, which always has a visible rendition, the
character U+00AD SOFT HYPHEN (SHY) is an invisible format character that
merely indicates a preferred intraword line break position
...
Depending on the language and the word, that may produce different visible
results[2]

So going by this phrase the character should not be printed and have no
incluence on the text if it´s not used as a linebreak. The problem arises
on how the terminal handles this character. In the case of xterm it
appears to always print the character (printf "\302\255"), which according
to Annex #14 is wrong. If you were to use another terminal which honours
the this guideline OpenBSD would be correct and glibc etc is wrong.

There´s also something to say for the way FreeBSD handles it, but that
would break things even more on some OpenBSD applications, like ls(1),
where a wcwidth of -1 would print a ´?´, which is even worse. Maybe
this should be revisited and just skip these characters completely, but
that´s outside the scope of this discussion.

In conclusion: As long as the output device isn´t the database used to
determine how things are displayed there´s no 100% guarantee that the
software calculating the column width is doing the right thing.
However, based on the description by the Unicode Consortium I think
OpenBSD does the right thing and xterm and others should be fixed,
especially if they just do a dumb printing of the characters without
taking the proper line breaking rules into account and just keep on
printing until the end of the screen and then continue on the next line.
This goes double if the printing of the hyphen must cause visible
changes (like spelling) according to the language rules.

martijn@

On Thu, 2021-04-01 at 08:27 +0300, Lauri Tirkkonen wrote:
> When using terminal software on non-OpenBSD to connect to my OpenBSD IRC
> machine, I noticed that sometimes the local terminal disagrees with the remote
> tmux and application (in this case, irssi) about the character width of some
> lines, causing different kinds of breakage. Those lines happened to contain 
> soft
> hyphens (U+00AD), which behave as follows across a few different operating
> systems:
> 
> OpenBSD-CURRENT:        iswprint(SHY) = 1 iswcntrl(SHY) = 1 wcwidth(SHY) = 0
> NetBSD 9.1:             iswprint(SHY) = 1 iswcntrl(SHY) = 0 wcwidth(SHY) = 1
> FreeBSD 12.2:           iswprint(SHY) = 0 iswcntrl(SHY) = 1 wcwidth(SHY) = -1
> glibc (Debian sid):     iswprint(SHY) = 1 iswcntrl(SHY) = 0 wcwidth(SHY) = 1
> musl (Alpine 3.13.3):   iswprint(SHY) = 1 iswcntrl(SHY) = 0 wcwidth(SHY) = 1
> 
> On Windows, PowerShell, PuTTY and MinTTY (shipped with the default install of
> git from git-scm.com as part of MSYS2) render the soft hyphen as a visible
> character with a width of 1 column.
> 
> The OpenBSD wcwidth(SHY) of 0 is what the problem comes down to (FreeBSD's
> return values are also strange, but this is an OpenBSD list). There is a lot 
> of
> background discussion about whether or not Unicode intends the SHY to be
> printable or not, and whether it should have width of 0 or 1, in eg. [0] and
> [1], but for better or worse, it seems most other systems decided that SHY 
> has a
> width of 1 and should be a visible character (at least in terminal contexts).
> 
> Therefore, in the interest of interoperability, I propose the following diff 
> to
> special-case SHY into having a width of 1. I don't intend to go down the 
> rabbit
> hole of a discussion regarding what the 'correct' width is, but the 
> discrepancy
> with other systems causes real problems, and I think those other systems made
> their decisions years ago (see eg. [0] for glibc).
> 
> Diff below only for gen_ctype_utf8.pl; I am not including the resulting
> en_US.UTF-8.src diff, because it seems there is a Unicode 12.1.0 to 13.0.0
> update that happens on regeneration of that file, and that is orthogonal to 
> this
> change (essentially: [2], which has not been committed yet)
> 
> [0]: https://sourceware.org/bugzilla/show_bug.cgi?id=22073
> [1]: https://jkorpela.fi/shy.html
> [2]: https://marc.info/?l=openbsd-tech&m=161534047428793&w=2
> 
> diff --git a/share/locale/ctype/gen_ctype_utf8.pl 
> b/share/locale/ctype/gen_ctype_utf8.pl
> index e23472efb2c..c593dc628ee 100755
> --- a/share/locale/ctype/gen_ctype_utf8.pl
> +++ b/share/locale/ctype/gen_ctype_utf8.pl
> @@ -404,6 +404,9 @@ sub codepoint_columns
>  
>         # Several fonts provide glyphs in this range
>         return 1 if $code >= 0xe000 and $code <= 0xf8ff;
> +       # Soft hyphen (SHY) is in category Cf, which implies width 0, but 
> since
> +       # it is width 1 in nearly every other environment, set it here.
> +       return 1 if $code == 0x00ad;
>  
>         return 0 if $charinfo->{category} eq 'Mn';
>         return 0 if $charinfo->{category} eq 'Me';
> 
[0] https://www.unicode.org/versions/Unicode13.0.0/UnicodeStandard-13.0.pdf
[1] https://www.unicode.org/reports/tr14/tr14-45.html#SoftHyphen
[2] There´s more nuance that must be looked at before jumping to
    conclusions. But that would be overkill for this mail.

Reply via email to