On Tue, 2021-04-06 at 13:27 +0100, Stuart Henderson wrote:
> On 2021/04/06 13:09, Martijn van Duren wrote:
> > I´m also not convinced that the other wcwidth implementations might be
> > on to something and that the unicode consortium is having inertia
> > problems.
> 
> The difficulty is that it isn't *possible* to give a single correct
> answer for the width of SHY, it varies and can only be identified
> when other information about the terminal is taken into account (how
> the terminal behaves and whether the word currently printed is being
> wrapped), which is out of scope for wcwidth(3). So no surprise
> different people come up with a different way to handle it.

My statement is that we have xterm in UTF-8 mode and we only support
ASCII/UTF-8 in base. So we should use the unicode definitions. They
state that a SHY should only be replaced by a hyphen on the end of the
line and taking localized grammar rules into account.
Since the shell never looks at ZWSP/SHY/whatever character for breaking
up a word over multiple lines it should *never* be visible on the shell
making our definition of 0 width always correct. If an application uses
it to break a word over two lines it needs to take the local grammar
into account, potentially changing the surrounding characters. In that
case the application only uses it as an indicator of the hyphenated
breakup and should place an actual hyphen there itself, making the SHY
still only an invisible indicator with width 0.
> 
> > If you want to show a hyphen in your text, use a hyphen. If you want to
> > indicate where a word might be broken up in a hyphenated way across two
> > lines if the software knows the localized grammar rules use a SHY.
> > Also thanks to sthen@ for pointing out where the confusion comes from:
> > we´re using UTF-8 here, not ISO-8859-1, so we must make sure that we
> > use the UTF-8 definitions.
> 
> but, guess what happens when text is converted from ISO-8859-1 to UTF-8...
> 
> $ printf '\xad' | iconv -f iso-8859-1 -t utf-8 | hexdump -C
> 00000000  c2 ad                                             |..|
> 
If ISO-8859-1 SHY has no 1-on-1 counterpart in unicode I´d probably
choose the same conversion. That doesn´t make them equal, just a
close enough aproximation for automated tooling.

Reply via email to