Re: wcwidth of soft hyphen

2021-04-16 Thread Martijn van Duren
On Thu, 2021-04-15 at 14:20 +0200, Martijn van Duren wrote:
> I did some archeology today and found that it used to behave as
> non-printable, but it got broken in release 334 (august 2018), when
> CharWidth was introduced. Before that my_wcwidth was used directly.
> 
> Since there doesn't appear to be a repository with commit messages I'm
> not 100% sure why this macro was introduced. My best guess at this
> point would be the following line from the xterm.log.html:
> several minor performance improvements using macros, e.g., inline checks
> for character width.
> Which would imply that it is indeed a bug in xterm.
> I mailed Thomas Dickey to ask his view on the situation and maybe get
> some context. Answer pending.

Answer received: It's as I described it and fixing it is on his todo
list.
> 
> On Wed, 2021-04-14 at 21:25 +0200, Martijn van Duren wrote:
> > On Wed, 2021-04-14 at 20:10 +0300, Lauri Tirkkonen wrote:
> > > Since the discussion seems to have died out, I take my patch will not be
> > > accepted.
> > > 
> > > The decision appears to be that OpenBSD is right and everyone else is 
> > > wrong in
> > > this matter. Given that, and the calls to change the behavior of other 
> > > OSes and
> > > terminal emulators around SHY: are you going to at least patch xterm 
> > > in-tree so
> > > that it does not render SHY?
> > > 
> > > Or must it remain broken?
> > > 
> > Looking closer at the xterm source corroborated my previous reasoning.
> > From xterm's wcwidth.c:
> > /*
> >  * Provide a way to change the behavior of soft-hyphen.
> >  */
> > void mk_wcwidth_init(int mode)
> > {
> >   use_latin1 = (mode == 0);
> > }
> > 
> > and
> > 
> >  *    - SOFT HYPHEN (U+00AD) has a column width of 1 in Latin-1, 0 in 
> > Unicode.
> >  *  An initialization function is used to switch between the two.
> > 
> > So it is the intention of xterm to not display the soft hyphen in
> > unicode mode.
> > 
> > This is also corrobarated by charproc.c5799 where the error occurs:
> >     if (ch == 0xad) {    
> >     /*  
> >     
> >  * Only display soft-hyphen if it happens to be 
> > at
> >  * the right-margin.  While that means that only
> >  * the displayed character could be selected for
> >  * pasting, a well-behaved application would 
> > never
> >  * send this, anyway...
> >  */
> > 
> > The problem here is that on line 5795 we have:
> > last_chomp = CharWidth(buf[n]);
> > which expands to:
> > CharWidth(n) (((n) < 256) ? (IsLatin1(n) ? 1 : 0) : my_wcwidth((wchar_t) 
> > (n)))
> > and
> > #define IsLatin1(n)  (((n) >= 32 && (n) <= 126) || ((n) >= 160 && (n) <= 
> > 255))
> > 
> > So here's the big oops: CharWidth doesn't know we're in UTF-8 mode and
> > we never reach my_wcwidth.
> > 
> > Diff below fixes this behaviour for me and restores the printing
> > behaviour when I run xterm with +u8 to reset utf-8 mode.
> > However, I'm no xterm hacker and it's quite a beast, so this needs
> > proper testing and scrutiny from someone who knows the code to make
> > sure there's no use of uninitialized variables. (CC matthieu@)
> > 
> > No intention of pushing this for 6.9, but maybe someone brave is
> > willing to dive in here after me.
> > 
> > martijn@
> > 
> > Index: charproc.c
> > ===
> > RCS file: /cvs/xenocara/app/xterm/charproc.c,v
> > retrieving revision 1.49
> > diff -u -p -r1.49 charproc.c
> > --- charproc.c  2 Apr 2021 18:44:19 -   1.49
> > +++ charproc.c  14 Apr 2021 19:24:14 -
> > @@ -2305,7 +2305,7 @@ doparsing(XtermWidget xw, unsigned c, st
> >  */
> > if (c >= 0x300
> >     && screen->wide_chars
> > -   && CharWidth(c) == 0
> > +   && CharWidth(screen, c) == 0
> >     && !isWideControl(c)) {
> >     int prev, test;
> >     Boolean used = True;
> > @@ -2330,9 +2330,9 @@ doparsing(XtermWidget xw, unsigned c, st
> > prev = (int) XTERM_CELL(use_row, use_col);
> > test = do_precomposition(prev, (int) c);
> > TRACE(("do_precomposition (U+%04X [%d], U+%04X [%d]) -> 
> > U+%04X [%d]\n",
> > -  prev, CharWidth(prev),
> > -  (int) c, CharWidth(c),
> > -  test, CharWidth(test)));
> > +  prev, CharWidth(screen, prev),
> > +  (int) c, CharWidth(screen, c),
> > +  test, CharWidth(screen, test)));
> >     } else {
> > prev = -1;
> > test = -1;
> > @@ -2342,7 +2342,7 @@ doparsing(XtermWidget xw, unsigned c, st
> >  * only if it does not change the width o

Re: wcwidth of soft hyphen

2021-04-15 Thread Martijn van Duren
I did some archeology today and found that it used to behave as
non-printable, but it got broken in release 334 (august 2018), when
CharWidth was introduced. Before that my_wcwidth was used directly.

Since there doesn't appear to be a repository with commit messages I'm
not 100% sure why this macro was introduced. My best guess at this
point would be the following line from the xterm.log.html:
several minor performance improvements using macros, e.g., inline checks
for character width.
Which would imply that it is indeed a bug in xterm.
I mailed Thomas Dickey to ask his view on the situation and maybe get
some context. Answer pending.

On Wed, 2021-04-14 at 21:25 +0200, Martijn van Duren wrote:
> On Wed, 2021-04-14 at 20:10 +0300, Lauri Tirkkonen wrote:
> > Since the discussion seems to have died out, I take my patch will not be
> > accepted.
> > 
> > The decision appears to be that OpenBSD is right and everyone else is wrong 
> > in
> > this matter. Given that, and the calls to change the behavior of other OSes 
> > and
> > terminal emulators around SHY: are you going to at least patch xterm 
> > in-tree so
> > that it does not render SHY?
> > 
> > Or must it remain broken?
> > 
> Looking closer at the xterm source corroborated my previous reasoning.
> From xterm's wcwidth.c:
> /*
>  * Provide a way to change the behavior of soft-hyphen.
>  */
> void mk_wcwidth_init(int mode)
> {
>   use_latin1 = (mode == 0);
> }
> 
> and
> 
>  *    - SOFT HYPHEN (U+00AD) has a column width of 1 in Latin-1, 0 in Unicode.
>  *  An initialization function is used to switch between the two.
> 
> So it is the intention of xterm to not display the soft hyphen in
> unicode mode.
> 
> This is also corrobarated by charproc.c5799 where the error occurs:
>     if (ch == 0xad) {    
>     /*
>   
>  * Only display soft-hyphen if it happens to be at
>  * the right-margin.  While that means that only
>  * the displayed character could be selected for
>  * pasting, a well-behaved application would never
>  * send this, anyway...
>  */
> 
> The problem here is that on line 5795 we have:
> last_chomp = CharWidth(buf[n]);
> which expands to:
> CharWidth(n) (((n) < 256) ? (IsLatin1(n) ? 1 : 0) : my_wcwidth((wchar_t) (n)))
> and
> #define IsLatin1(n)  (((n) >= 32 && (n) <= 126) || ((n) >= 160 && (n) <= 255))
> 
> So here's the big oops: CharWidth doesn't know we're in UTF-8 mode and
> we never reach my_wcwidth.
> 
> Diff below fixes this behaviour for me and restores the printing
> behaviour when I run xterm with +u8 to reset utf-8 mode.
> However, I'm no xterm hacker and it's quite a beast, so this needs
> proper testing and scrutiny from someone who knows the code to make
> sure there's no use of uninitialized variables. (CC matthieu@)
> 
> No intention of pushing this for 6.9, but maybe someone brave is
> willing to dive in here after me.
> 
> martijn@
> 
> Index: charproc.c
> ===
> RCS file: /cvs/xenocara/app/xterm/charproc.c,v
> retrieving revision 1.49
> diff -u -p -r1.49 charproc.c
> --- charproc.c  2 Apr 2021 18:44:19 -   1.49
> +++ charproc.c  14 Apr 2021 19:24:14 -
> @@ -2305,7 +2305,7 @@ doparsing(XtermWidget xw, unsigned c, st
>  */
> if (c >= 0x300
>     && screen->wide_chars
> -   && CharWidth(c) == 0
> +   && CharWidth(screen, c) == 0
>     && !isWideControl(c)) {
>     int prev, test;
>     Boolean used = True;
> @@ -2330,9 +2330,9 @@ doparsing(XtermWidget xw, unsigned c, st
> prev = (int) XTERM_CELL(use_row, use_col);
> test = do_precomposition(prev, (int) c);
> TRACE(("do_precomposition (U+%04X [%d], U+%04X [%d]) -> 
> U+%04X [%d]\n",
> -  prev, CharWidth(prev),
> -  (int) c, CharWidth(c),
> -  test, CharWidth(test)));
> +  prev, CharWidth(screen, prev),
> +  (int) c, CharWidth(screen, c),
> +  test, CharWidth(screen, test)));
>     } else {
> prev = -1;
> test = -1;
> @@ -2342,7 +2342,7 @@ doparsing(XtermWidget xw, unsigned c, st
>  * only if it does not change the width of the base character
>  */
>     if (test != -1
> -   && CharWidth(test) == CharWidth(prev)) {
> +   && CharWidth(screen, test) == CharWidth(screen, prev)) {
> putXtermCell(screen, use_row, use_col, test);
>     } else if (screen->char_was_written
>    || getXtermCell(screen, use_row, use_col

Re: wcwidth of soft hyphen

2021-04-14 Thread Lauri Tirkkonen
Since the discussion seems to have died out, I take my patch will not be
accepted.

The decision appears to be that OpenBSD is right and everyone else is wrong in
this matter. Given that, and the calls to change the behavior of other OSes and
terminal emulators around SHY: are you going to at least patch xterm in-tree so
that it does not render SHY?

Or must it remain broken?

-- 
Lauri Tirkkonen | lotheac @ IRCnet



Re: wcwidth of soft hyphen

2021-04-14 Thread Martijn van Duren
On Wed, 2021-04-14 at 20:10 +0300, Lauri Tirkkonen wrote:
> Since the discussion seems to have died out, I take my patch will not be
> accepted.
> 
> The decision appears to be that OpenBSD is right and everyone else is wrong in
> this matter. Given that, and the calls to change the behavior of other OSes 
> and
> terminal emulators around SHY: are you going to at least patch xterm in-tree 
> so
> that it does not render SHY?
> 
> Or must it remain broken?
> 
Looking closer at the xterm source corroborated my previous reasoning.
>From xterm's wcwidth.c:
/*
 * Provide a way to change the behavior of soft-hyphen.
 */
void mk_wcwidth_init(int mode)
{
  use_latin1 = (mode == 0);
}

and

 *- SOFT HYPHEN (U+00AD) has a column width of 1 in Latin-1, 0 in Unicode.
 *  An initialization function is used to switch between the two.

So it is the intention of xterm to not display the soft hyphen in
unicode mode.

This is also corrobarated by charproc.c5799 where the error occurs:
if (ch == 0xad) {
/*  

 * Only display soft-hyphen if it happens to be at
 * the right-margin.  While that means that only
 * the displayed character could be selected for
 * pasting, a well-behaved application would never
 * send this, anyway...
 */

The problem here is that on line 5795 we have:
last_chomp = CharWidth(buf[n]);
which expands to:
CharWidth(n) (((n) < 256) ? (IsLatin1(n) ? 1 : 0) : my_wcwidth((wchar_t) (n)))
and
#define IsLatin1(n)  (((n) >= 32 && (n) <= 126) || ((n) >= 160 && (n) <= 255))

So here's the big oops: CharWidth doesn't know we're in UTF-8 mode and
we never reach my_wcwidth.

Diff below fixes this behaviour for me and restores the printing
behaviour when I run xterm with +u8 to reset utf-8 mode.
However, I'm no xterm hacker and it's quite a beast, so this needs
proper testing and scrutiny from someone who knows the code to make
sure there's no use of uninitialized variables. (CC matthieu@)

No intention of pushing this for 6.9, but maybe someone brave is
willing to dive in here after me.

martijn@

Index: charproc.c
===
RCS file: /cvs/xenocara/app/xterm/charproc.c,v
retrieving revision 1.49
diff -u -p -r1.49 charproc.c
--- charproc.c  2 Apr 2021 18:44:19 -   1.49
+++ charproc.c  14 Apr 2021 19:24:14 -
@@ -2305,7 +2305,7 @@ doparsing(XtermWidget xw, unsigned c, st
 */
if (c >= 0x300
&& screen->wide_chars
-   && CharWidth(c) == 0
+   && CharWidth(screen, c) == 0
&& !isWideControl(c)) {
int prev, test;
Boolean used = True;
@@ -2330,9 +2330,9 @@ doparsing(XtermWidget xw, unsigned c, st
prev = (int) XTERM_CELL(use_row, use_col);
test = do_precomposition(prev, (int) c);
TRACE(("do_precomposition (U+%04X [%d], U+%04X [%d]) -> U+%04X 
[%d]\n",
-  prev, CharWidth(prev),
-  (int) c, CharWidth(c),
-  test, CharWidth(test)));
+  prev, CharWidth(screen, prev),
+  (int) c, CharWidth(screen, c),
+  test, CharWidth(screen, test)));
} else {
prev = -1;
test = -1;
@@ -2342,7 +2342,7 @@ doparsing(XtermWidget xw, unsigned c, st
 * only if it does not change the width of the base character
 */
if (test != -1
-   && CharWidth(test) == CharWidth(prev)) {
+   && CharWidth(screen, test) == CharWidth(screen, prev)) {
putXtermCell(screen, use_row, use_col, test);
} else if (screen->char_was_written
   || getXtermCell(screen, use_row, use_col) >= ' ') {
@@ -4551,7 +4551,7 @@ doparsing(XtermWidget xw, unsigned c, st
value = zero_if_default(0);
 
TRACE(("CASE_DECFRA - Fill rectangular area\n"));
-   if (nparam > 0 && CharWidth(value) > 0) {
+   if (nparam > 0 && CharWidth(screen, value) > 0) {
xtermParseRect(xw, ParamPair(1), &myRect);
ScrnFillRectangle(xw, &myRect, value, xw->flags, True);
}
@@ -4860,7 +4860,7 @@ doparsing(XtermWidget xw, unsigned c, st
 
case CASE_REP:
TRACE(("CASE_REP\n"));
-   if (CharWidth(sp->lastchar) > 0) {
+   if (CharWidth(screen, sp->lastchar) > 0) {
IChar repeated[2];
count = one_if_default(0);
repeated[0] = (IChar) sp->lastchar;
@@ -5792,7 +5792,7 @@ dotext(XtermWidget xw,
   buf[n] <= 0xa0) {

Re: wcwidth of soft hyphen

2021-04-06 Thread Lauri Tirkkonen
On Tue, Apr 06 2021 11:27:21 +0100, Stuart Henderson wrote:
> Some terminal emulators are using iso-8859-1 semantics of soft hyphen,
> unicode did things differently but those terminals haven't changed.
> 
> xterm printed as hyphen
> mltermprinted as hyphen
> putty printed as hyphen
> urxvt overprinted on previous character
> stnot printed, no space
> kitty not printed, no space
> cool-retro-term   not printed, no space
> sakuraprinted as space

st actually relies on wcwidth(), so on Debian (for example) it prints the SHY
as a hyphen.

> Pragmatically the simplest fix for the original problem might be if
> irssi filtered out soft-hyphen characters like mutt does in its
> "is_display_corrupting_utf8()" function:
> 
> https://gitlab.com/muttmua/mutt/-/blob/master/mbyte.c#L528

Thanks, it's news to me that mutt does that. It speaks to something when an
application is explicitly hardcoding codepoints not to print. I don't
particularly like the 'solution' of every TUI application having to ship their
own fixes for stuff like this though.

-- 
Lauri Tirkkonen | lotheac @ IRCnet



Re: wcwidth of soft hyphen

2021-04-06 Thread Lauri Tirkkonen
On Tue, Apr 06 2021 13:09:11 +0200, Martijn van Duren wrote:
> On Thu, 2021-04-01 at 10:39 +0300, Lauri Tirkkonen wrote:
> > On Thu, Apr 01 2021 09:30:36 +0200, Martijn van Duren wrote:
> > > However, based on the description by the Unicode Consortium I think
> > > OpenBSD does the right thing and xterm and others should be fixed,
> > 
> > practically, I doubt this will happen. I don't think the glibc people will 
> > be
> > convinced to break compatibility to their older versions, for example. I
> > explicitly mentioned I don't wish to engage in a discussion about which way 
> > is
> > _correct_ - I am interested in interoperability with real, existing systems.
> > 
> I´m not convinced that you´ve shown that it´s actually an
> interoperability issue. In your last mail you state that it´s a simple
> display difference between tmux and raw xterm on OpenBSD. To me that´s
> similar to most linux distro´s having grep being an alias for
> grep --color=auto by default and stating that we should do the same
> because you like pretty colours. What applications fail to operate or
> operate in an severely erroneous way because of this discrepency?

I'll try again to describe the problem, and show an example.

TUI applications often care, for layout purposes, how long a particular string
or line will be on the output device. A not insignificant number of those
applications use wcwidth() to figure out how much column space will be taken by
a certain character or string.

If the application performing the width calculations is running on a different
machine than the terminal, say, through ssh, it is important that the
application's idea of width matches what the terminal will eventually render; if
it doesn't, then the application could print the string over some other TUI
element, for example.

This is difficult and messy for many reasons already discussed, especially when
different operating systems disagree about the width or printability of a
character. Nevertheles , in 2021, wcwidth implementations mostly agree and even
things like emojis get a wcwidth of 2 everywhere I've observed (in contrast to
some -1 wcwidths of printable characters I observed on other OSes in the past).
But SHY seems to still be something that causes issues in terminals, at least
for me.

As the example, I ran the command "/exec cat longshy.txt /etc/motd" inside of
irssi, in a 80x24 terminal window, with a few different
terminal/OS-running-terminal/OS-running-irssi configurations. 'longshy.txt' is
available at https://hacktheplanet.fi/shy/longshy.txt

Let's start with st(1), since it's simple and uses wcwidth() directly to decide
how wide a character should be printed:

st on OpenBSD, local irssi https://hacktheplanet.fi/shy/st-openbsd-local.png
st on Debian, local irssi https://hacktheplanet.fi/shy/st-debian-local.png

Here we can see two key things:
 1) on Debian, st is rendering the SHY characters - on OpenBSD it is not
 2) on Debian, irssi considers the line long enough that it splits it and prints
the remainder on the next line, indented

So, let's introduce ssh into the mix:

st on OpenBSD, irssi on Debian 
https://hacktheplanet.fi/shy/st-openbsd-ssh-debian.png
st on Debian, irssi on OpenBSD 
https://hacktheplanet.fi/shy/st-debian-ssh-openbsd.png

We begin to see differences that stem from wcwidth(SHY). These problems aren't
very big, since in both cases the output is still readable and no information is
lost.

Now, let's try xterm(1). It has been observed in this thread that xterm always
prints SHY.

xterm on OpenBSD, local irssi 
https://hacktheplanet.fi/shy/xterm-openbsd-local.png
xterm on Debian, local irssi https://hacktheplanet.fi/shy/xterm-debian-local.png

On OpenBSD, irssi thinks that the entire line fits into the 80 columns
available. But because xterm prints SHYs, the line overflows onto the next and
is promptly overwritten by the next line that irssi puts there (the motd).

And finally ssh with xterm:

xterm on OpenBSD, irssi on Debian 
https://hacktheplanet.fi/shy/xterm-openbsd-ssh-debian.png
xterm on Debian, irssi on OpenBSD 
https://hacktheplanet.fi/shy/xterm-debian-ssh-openbsd.png

This isn't the best example: there are many different problems that can arise
from the width calculation discrepancy - some of them can be more spectacular I
think, but I could only come up with this one on demand.

Despite the bad example, I do consider cases where text messes up in ways the
application did not intend (in the worst case, overwriting other text) on the
same terminal on different operating systems interoperability bugs. In this case
the outputs are different due to interactions between systems that use
wcwidth(SHY) = 1 (such as, apparently, xterm even locally) and OpenBSD. I
might not say it is "operating in a severely erroneous way", but then I don't
consider "severely erroneous" as a requirement to fix issues.

> If you want to show a hyphen in your text, use a hyphen. If you want to
> indicate where a word might be broke

Re: wcwidth of soft hyphen

2021-04-06 Thread Martijn van Duren
On Tue, 2021-04-06 at 13:27 +0100, Stuart Henderson wrote:
> On 2021/04/06 13:09, Martijn van Duren wrote:
> > I´m also not convinced that the other wcwidth implementations might be
> > on to something and that the unicode consortium is having inertia
> > problems.
> 
> The difficulty is that it isn't *possible* to give a single correct
> answer for the width of SHY, it varies and can only be identified
> when other information about the terminal is taken into account (how
> the terminal behaves and whether the word currently printed is being
> wrapped), which is out of scope for wcwidth(3). So no surprise
> different people come up with a different way to handle it.

My statement is that we have xterm in UTF-8 mode and we only support
ASCII/UTF-8 in base. So we should use the unicode definitions. They
state that a SHY should only be replaced by a hyphen on the end of the
line and taking localized grammar rules into account.
Since the shell never looks at ZWSP/SHY/whatever character for breaking
up a word over multiple lines it should *never* be visible on the shell
making our definition of 0 width always correct. If an application uses
it to break a word over two lines it needs to take the local grammar
into account, potentially changing the surrounding characters. In that
case the application only uses it as an indicator of the hyphenated
breakup and should place an actual hyphen there itself, making the SHY
still only an invisible indicator with width 0.
> 
> > If you want to show a hyphen in your text, use a hyphen. If you want to
> > indicate where a word might be broken up in a hyphenated way across two
> > lines if the software knows the localized grammar rules use a SHY.
> > Also thanks to sthen@ for pointing out where the confusion comes from:
> > we´re using UTF-8 here, not ISO-8859-1, so we must make sure that we
> > use the UTF-8 definitions.
> 
> but, guess what happens when text is converted from ISO-8859-1 to UTF-8...
> 
> $ printf '\xad' | iconv -f iso-8859-1 -t utf-8 | hexdump -C
>   c2 ad |..|
> 
If ISO-8859-1 SHY has no 1-on-1 counterpart in unicode I´d probably
choose the same conversion. That doesn´t make them equal, just a
close enough aproximation for automated tooling.



Re: wcwidth of soft hyphen

2021-04-06 Thread Stuart Henderson
On 2021/04/06 13:09, Martijn van Duren wrote:
> I´m also not convinced that the other wcwidth implementations might be
> on to something and that the unicode consortium is having inertia
> problems.

The difficulty is that it isn't *possible* to give a single correct
answer for the width of SHY, it varies and can only be identified
when other information about the terminal is taken into account (how
the terminal behaves and whether the word currently printed is being
wrapped), which is out of scope for wcwidth(3). So no surprise
different people come up with a different way to handle it.

> If you want to show a hyphen in your text, use a hyphen. If you want to
> indicate where a word might be broken up in a hyphenated way across two
> lines if the software knows the localized grammar rules use a SHY.
> Also thanks to sthen@ for pointing out where the confusion comes from:
> we´re using UTF-8 here, not ISO-8859-1, so we must make sure that we
> use the UTF-8 definitions.

but, guess what happens when text is converted from ISO-8859-1 to UTF-8...

$ printf '\xad' | iconv -f iso-8859-1 -t utf-8 | hexdump -C
  c2 ad |..|



Re: wcwidth of soft hyphen

2021-04-06 Thread Martijn van Duren
On Mon, 2021-04-05 at 20:30 +0200, Ingo Schwarze wrote:
> Hi,
> 
> Martijn van Duren wrote on Thu, Apr 01, 2021 at 09:30:36AM +0200:
> > So going by this phrase the character should not be printed
> 
> When formatting a document, for example for printing on paper or
> the online equivalent like PostScript or PDF, i agree.  But i
> strongly prefer the terminal to always display this character because
> the terminal's usual purpose is not nice text formatting for visual
> consumption.  It should usually show the full content of strings
> or files, be it for inspection or for editing.  Omitting characters
> in such contexts sets nasty traps for the person working with the
> terminal.
> 
> So i say nothing should be changed at all in OpenBSD.
> 
> Yes, that means column counting is wrong on the terminal, but that's
> a very minor problem, if it's a problem at all, compared to the havoc
> that could result from not showing the character on the terminal at
> all, and it cannot be fixed without causing worse problems in situations
> that matter more.

I disagree with you here. As sthen@ just pointed out this is most likely
a legacy print from ISO-8559-1 which uses a different definition of SHY.
Saying that not showing a character on the terminal at all can cause
havoc also have different implications: we would have to start printing
ZWSP and have to make a stronger distinction between tab and space.
And that´s just a few examples top of the head.

If you want to see the actual text you´re working with you need
something like vis(1), hexdump(1), or something more sophisticated for
UTF-8.

We claim we support UTF-8, so we should use the unicode consortium
definitions. Especially if they make linguistic sense; which it does.
> 
> The bug in NetBSD and Linux should be fixed, but that's off-topic here.

And I´d like to add terminals in unicode mode to that list.
> 
> Yours,
>   Ingo

martijn@



Re: wcwidth of soft hyphen

2021-04-06 Thread Martijn van Duren
On Thu, 2021-04-01 at 10:39 +0300, Lauri Tirkkonen wrote:
> On Thu, Apr 01 2021 09:30:36 +0200, Martijn van Duren wrote:
> > However, based on the description by the Unicode Consortium I think
> > OpenBSD does the right thing and xterm and others should be fixed,
> 
> practically, I doubt this will happen. I don't think the glibc people will be
> convinced to break compatibility to their older versions, for example. I
> explicitly mentioned I don't wish to engage in a discussion about which way is
> _correct_ - I am interested in interoperability with real, existing systems.
> 
I´m not convinced that you´ve shown that it´s actually an
interoperability issue. In your last mail you state that it´s a simple
display difference between tmux and raw xterm on OpenBSD. To me that´s
similar to most linux distro´s having grep being an alias for
grep --color=auto by default and stating that we should do the same
because you like pretty colours. What applications fail to operate or
operate in an severely erroneous way because of this discrepency?

I´m also not convinced that the other wcwidth implementations might be
on to something and that the unicode consortium is having inertia
problems. In my previous mail I quoted on what linguistic constructs
the character is based and that it is invisible. To stick with their
example: I write "opaatje" or "opa-" LF "tje", not "opaa-tje".

If you want to show a hyphen in your text, use a hyphen. If you want to
indicate where a word might be broken up in a hyphenated way across two
lines if the software knows the localized grammar rules use a SHY.
Also thanks to sthen@ for pointing out where the confusion comes from:
we´re using UTF-8 here, not ISO-8859-1, so we must make sure that we
use the UTF-8 definitions.

martijn@



Re: wcwidth of soft hyphen

2021-04-06 Thread Stuart Henderson
On 2021/04/05 12:45, Theo de Raadt wrote:
> So, your argument is that displays should remain broken forever.
> 
> > The bug in NetBSD and Linux should be fixed, but that's off-topic here.
> 
> If you cannot explain how this problem is going to be fixed (reversed)
> in these opposing ecosystems (and it is not just Linux and NetBSD), then
> you've closed the argument with a cop-out.
> 
> It cannot be off-topic.
> 
> It is an interop problem which must be settled.
> 
> From time to time, defacto standards arise which have inertia that
> is too great to fight.
> 
> Your position seems to me that original standard are etched in stone and
> it is impossible to have new defacto standards arise, and if interop
> issues arrive, screw everyone -- can't they see there is a stone?

Some terminal emulators are using iso-8859-1 semantics of soft hyphen,
unicode did things differently but those terminals haven't changed.

xterm   printed as hyphen
mlterm  printed as hyphen
putty   printed as hyphen
urxvt   overprinted on previous character
st  not printed, no space
kitty   not printed, no space
cool-retro-term not printed, no space
sakura  printed as space

There's some more about this on https://jkorpela.fi/shy.html,
it's all a mess.

Pragmatically the simplest fix for the original problem might be if
irssi filtered out soft-hyphen characters like mutt does in its
"is_display_corrupting_utf8()" function:

https://gitlab.com/muttmua/mutt/-/blob/master/mbyte.c#L528



Re: wcwidth of soft hyphen

2021-04-05 Thread Lauri Tirkkonen
Hi Ingo,

On Mon, Apr 05 2021 20:30:39 +0200, Ingo Schwarze wrote:
> Whether all control chars are always width 0 can maybe also be
> disputed.  Again, the stronger argument seems to me that they are.
> If they weren't, they would not be control characters but alphanumeric,
> punctuation, spaces, or special printable characters, none of which
> they are.  I say width 1 and 2 require standalone glyphs that are
> normally used for the character.  Besides, no operating system
> correctly identifies this as a control character and yet gives it
> width 1.

I agree with your assessments about iswcntrl and iswprint. My original patch
proposed to keep those return values as is, and only change the wcwidth() from 0
to 1.

> I insist that the discussion should remain very strictly formal,
> about the properties and classification in the Unicode data files
> and nothing else.  If people start arguing about what makes sense
> for any particular character, that's already an argument going
> astray.

In general I agree, but I contend that SHY is, unfortunately, a little bit
special. This confusion about its printability and/or column width is definitely
not unique to OpenBSD.

> > So going by this phrase the character should not be printed
> 
> When formatting a document, for example for printing on paper or
> the online equivalent like PostScript or PDF, i agree.  But i
> strongly prefer the terminal to always display this character because
> the terminal's usual purpose is not nice text formatting for visual
> consumption.  It should usually show the full content of strings
> or files, be it for inspection or for editing.  Omitting characters
> in such contexts sets nasty traps for the person working with the
> terminal.

I agree with this completely - you said it better than I could have. This is
another reason why I think it makes sense for this character to have wcwidth()
of 1 - applications that are "SHY-aware" can print (or not) the soft hyphen
however they wish, but terminal software seems to almost always ask wcwidth() to
figure out the column width. Indeed, terminal software is where I ran into the
problem of SHY sometimes being invisible.

> So i say nothing should be changed at all in OpenBSD.
> 
> Yes, that means column counting is wrong on the terminal, but that's
> a very minor problem, if it's a problem at all, compared to the havoc
> that could result from not showing the character on the terminal at
> all, and it cannot be fixed without causing worse problems in situations
> that matter more.

Right, I am not at all advocating to hide the SHY on the terminal - quite the
contrary, I want to make its width consistent.

The current situation, with SHY having a wcwidth() of 0 causes, for example, the
following discrepancy between xterm(1) (on the left) and tmux(1) in xterm(1) (on
the right): https://hacktheplanet.fi/shytmux.png -- in tmux, the SHY is not
visible.

The other issues I observed with discrepancies between OpenBSD and other systems
I already outlined in my initial mail.

-- 
Lauri Tirkkonen | lotheac @ IRCnet



Re: wcwidth of soft hyphen

2021-04-05 Thread Theo de Raadt
So, your argument is that displays should remain broken forever.

> The bug in NetBSD and Linux should be fixed, but that's off-topic here.

If you cannot explain how this problem is going to be fixed (reversed)
in these opposing ecosystems (and it is not just Linux and NetBSD), then
you've closed the argument with a cop-out.

It cannot be off-topic.

It is an interop problem which must be settled.

>From time to time, defacto standards arise which have inertia that
is too great to fight.

Your position seems to me that original standard are etched in stone and
it is impossible to have new defacto standards arise, and if interop
issues arrive, screw everyone -- can't they see there is a stone?

The position statement translates simply to: It must remain broken.

Ingo Schwarze  wrote:

> Hi,
> 
> Martijn van Duren wrote on Thu, Apr 01, 2021 at 09:30:36AM +0200:
> 
> > When it comes to these discussions I prefer to go back to the standards
> 
> I would propose an even more rigorous stance: not only go back to
> the standards, but use whatever the Unicode data files (indirectly,
> via the Perl modules) parsed by gen_ctype_utf8.pl specify.  Manually
> changing properties of individual characters should be restricted
> to very rare cases of crystal clear, absolutely unambiguous errors.
> When there is the slightest doubt or when there are arguments both
> ways, follow the Unicode data files and how Perl interprets them.
> 
> We have
> 
>   iswcntrl = 1  because UnicodeData.txt has class Cf (format control char)
>   iswprint = 1  because the class is neither Cc nor Cs
>   wcwidth  = 0  because the class starts with C (control char)
> 
> This is also neither obviously nor unambiguously wrong, so it should
> not be changed.
> 
> The choice of iswcntrl = 1 is most definitely correct because
> that's what class Cf says, there can be no doubt about that at all.
> Consequently, NetBSD, glibc, and musl are definitely buggy in so far
> as they return iswcntrl = 0.
> 
> Whether class Cf is always printable is maybe not absolutely clear.
> There are arguments both ways.  The stronger argument seems to be
> that these format control chars usually appear in the middle of
> printable characters and they are printed together with the
> surrounding characters.  But maybe the FreeBSD argument that
> some of them are sometimes not ptinted and hence iswprint = 0
> can also be made, though somewhat dubiously because sometimes
> they are printed.  Besides, which property would you use for
> deciding printability?  Please, don't resort to deciding that
> character-by-character.
> 
> Whether all control chars are always width 0 can maybe also be
> disputed.  Again, the stronger argument seems to me that they are.
> If they weren't, they would not be control characters but alphanumeric,
> punctuation, spaces, or special printable characters, none of which
> they are.  I say width 1 and 2 require standalone glyphs that are
> normally used for the character.  Besides, no operating system
> correctly identifies this as a control character and yet gives it
> width 1.
> 
> I insist that the discussion should remain very strictly formal,
> about the properties and classification in the Unicode data files
> and nothing else.  If people start arguing about what makes sense
> for any particular character, that's already an argument going
> astray.
> 
> 
> > So going by this phrase the character should not be printed
> 
> When formatting a document, for example for printing on paper or
> the online equivalent like PostScript or PDF, i agree.  But i
> strongly prefer the terminal to always display this character because
> the terminal's usual purpose is not nice text formatting for visual
> consumption.  It should usually show the full content of strings
> or files, be it for inspection or for editing.  Omitting characters
> in such contexts sets nasty traps for the person working with the
> terminal.
> 
> So i say nothing should be changed at all in OpenBSD.
> 
> Yes, that means column counting is wrong on the terminal, but that's
> a very minor problem, if it's a problem at all, compared to the havoc
> that could result from not showing the character on the terminal at
> all, and it cannot be fixed without causing worse problems in situations
> that matter more.
> 
> The bug in NetBSD and Linux should be fixed, but that's off-topic here.
> 
> Yours,
>   Ingo
> 



Re: wcwidth of soft hyphen

2021-04-05 Thread Ingo Schwarze
Hi,

Martijn van Duren wrote on Thu, Apr 01, 2021 at 09:30:36AM +0200:

> When it comes to these discussions I prefer to go back to the standards

I would propose an even more rigorous stance: not only go back to
the standards, but use whatever the Unicode data files (indirectly,
via the Perl modules) parsed by gen_ctype_utf8.pl specify.  Manually
changing properties of individual characters should be restricted
to very rare cases of crystal clear, absolutely unambiguous errors.
When there is the slightest doubt or when there are arguments both
ways, follow the Unicode data files and how Perl interprets them.

We have

  iswcntrl = 1  because UnicodeData.txt has class Cf (format control char)
  iswprint = 1  because the class is neither Cc nor Cs
  wcwidth  = 0  because the class starts with C (control char)

This is also neither obviously nor unambiguously wrong, so it should
not be changed.

The choice of iswcntrl = 1 is most definitely correct because
that's what class Cf says, there can be no doubt about that at all.
Consequently, NetBSD, glibc, and musl are definitely buggy in so far
as they return iswcntrl = 0.

Whether class Cf is always printable is maybe not absolutely clear.
There are arguments both ways.  The stronger argument seems to be
that these format control chars usually appear in the middle of
printable characters and they are printed together with the
surrounding characters.  But maybe the FreeBSD argument that
some of them are sometimes not ptinted and hence iswprint = 0
can also be made, though somewhat dubiously because sometimes
they are printed.  Besides, which property would you use for
deciding printability?  Please, don't resort to deciding that
character-by-character.

Whether all control chars are always width 0 can maybe also be
disputed.  Again, the stronger argument seems to me that they are.
If they weren't, they would not be control characters but alphanumeric,
punctuation, spaces, or special printable characters, none of which
they are.  I say width 1 and 2 require standalone glyphs that are
normally used for the character.  Besides, no operating system
correctly identifies this as a control character and yet gives it
width 1.

I insist that the discussion should remain very strictly formal,
about the properties and classification in the Unicode data files
and nothing else.  If people start arguing about what makes sense
for any particular character, that's already an argument going
astray.


> So going by this phrase the character should not be printed

When formatting a document, for example for printing on paper or
the online equivalent like PostScript or PDF, i agree.  But i
strongly prefer the terminal to always display this character because
the terminal's usual purpose is not nice text formatting for visual
consumption.  It should usually show the full content of strings
or files, be it for inspection or for editing.  Omitting characters
in such contexts sets nasty traps for the person working with the
terminal.

So i say nothing should be changed at all in OpenBSD.

Yes, that means column counting is wrong on the terminal, but that's
a very minor problem, if it's a problem at all, compared to the havoc
that could result from not showing the character on the terminal at
all, and it cannot be fixed without causing worse problems in situations
that matter more.

The bug in NetBSD and Linux should be fixed, but that's off-topic here.

Yours,
  Ingo



Re: wcwidth of soft hyphen

2021-04-01 Thread Lauri Tirkkonen
On Thu, Apr 01 2021 09:30:36 +0200, Martijn van Duren wrote:
> However, based on the description by the Unicode Consortium I think
> OpenBSD does the right thing and xterm and others should be fixed,

practically, I doubt this will happen. I don't think the glibc people will be
convinced to break compatibility to their older versions, for example. I
explicitly mentioned I don't wish to engage in a discussion about which way is
_correct_ - I am interested in interoperability with real, existing systems.

-- 
Lauri Tirkkonen | lotheac @ IRCnet



Re: wcwidth of soft hyphen

2021-04-01 Thread Martijn van Duren
When it comes to these discussions I prefer to go back to the standards
and not just looking at the surrounding discussions.
The standard[0] states the following in section 23.2:
Hyphenation. U+00AD soft hyphen (SHY ) indicates an intraword break
point, where aline break is preferred if a word must be hyphenated or
otherwise broken across lines. Suchbreak points are generally determined
by an automatic hyphenator. SHY can be used withany  script,  but  its
use  is  generally  limited  to  situations  where  users  need  to
override  thebehavior of such a hyphenator. The visible rendering of a  
line break at an intraword breakpoint, whether automatically determined
or indicated by a SHY, depends on the surrounding characters, the rules
governing the script and language used, and, at times, the meaningof the
word. The precise rules are outside the scope of this standard, but see
Unicode Stan-dard Annex #14, "Unicode Line Breaking Algorithm," for 
additional information. A com-mon default rendering is to insert a 
hyphen before the line break, but this is insufficient or even incorrect 
in many situations

Where Annex #14 section 5.4[1] states begins with:
Unlike U+2010 HYPHEN, which always has a visible rendition, the
character U+00AD SOFT HYPHEN (SHY) is an invisible format character that
merely indicates a preferred intraword line break position
...
Depending on the language and the word, that may produce different visible
results[2]

So going by this phrase the character should not be printed and have no
incluence on the text if it´s not used as a linebreak. The problem arises
on how the terminal handles this character. In the case of xterm it
appears to always print the character (printf "\302\255"), which according
to Annex #14 is wrong. If you were to use another terminal which honours
the this guideline OpenBSD would be correct and glibc etc is wrong.

There´s also something to say for the way FreeBSD handles it, but that
would break things even more on some OpenBSD applications, like ls(1),
where a wcwidth of -1 would print a ´?´, which is even worse. Maybe
this should be revisited and just skip these characters completely, but
that´s outside the scope of this discussion.

In conclusion: As long as the output device isn´t the database used to
determine how things are displayed there´s no 100% guarantee that the
software calculating the column width is doing the right thing.
However, based on the description by the Unicode Consortium I think
OpenBSD does the right thing and xterm and others should be fixed,
especially if they just do a dumb printing of the characters without
taking the proper line breaking rules into account and just keep on
printing until the end of the screen and then continue on the next line.
This goes double if the printing of the hyphen must cause visible
changes (like spelling) according to the language rules.

martijn@

On Thu, 2021-04-01 at 08:27 +0300, Lauri Tirkkonen wrote:
> When using terminal software on non-OpenBSD to connect to my OpenBSD IRC
> machine, I noticed that sometimes the local terminal disagrees with the remote
> tmux and application (in this case, irssi) about the character width of some
> lines, causing different kinds of breakage. Those lines happened to contain 
> soft
> hyphens (U+00AD), which behave as follows across a few different operating
> systems:
> 
> OpenBSD-CURRENT:iswprint(SHY) = 1 iswcntrl(SHY) = 1 wcwidth(SHY) = 0
> NetBSD 9.1: iswprint(SHY) = 1 iswcntrl(SHY) = 0 wcwidth(SHY) = 1
> FreeBSD 12.2:   iswprint(SHY) = 0 iswcntrl(SHY) = 1 wcwidth(SHY) = -1
> glibc (Debian sid): iswprint(SHY) = 1 iswcntrl(SHY) = 0 wcwidth(SHY) = 1
> musl (Alpine 3.13.3):   iswprint(SHY) = 1 iswcntrl(SHY) = 0 wcwidth(SHY) = 1
> 
> On Windows, PowerShell, PuTTY and MinTTY (shipped with the default install of
> git from git-scm.com as part of MSYS2) render the soft hyphen as a visible
> character with a width of 1 column.
> 
> The OpenBSD wcwidth(SHY) of 0 is what the problem comes down to (FreeBSD's
> return values are also strange, but this is an OpenBSD list). There is a lot 
> of
> background discussion about whether or not Unicode intends the SHY to be
> printable or not, and whether it should have width of 0 or 1, in eg. [0] and
> [1], but for better or worse, it seems most other systems decided that SHY 
> has a
> width of 1 and should be a visible character (at least in terminal contexts).
> 
> Therefore, in the interest of interoperability, I propose the following diff 
> to
> special-case SHY into having a width of 1. I don't intend to go down the 
> rabbit
> hole of a discussion regarding what the 'correct' width is, but the 
> discrepancy
> with other systems causes real problems, and I think those other systems made
> their decisions years ago (see eg. [0] for glibc).
> 
> Diff below only for gen_ctype_utf8.pl; I am not including the resulting
> en_US.UTF-8.src diff, because it seems there is a Unicode 12.1.0 to 13.0.0
>