On 04/14/2013 02:10 AM, Christoph Lohmann wrote:
Greetings.

On Sun, 14 Apr 2013 08:10:22 +0200 Random832 <random...@fastmail.us> wrote:
I am forced to ask, though, why character cell values are stored in
utf-8 rather than as wchar_t (or as an explicitly unicode int) in the
first place, particularly since the simplest way to detect a wide
character is to call the function wcwidth. What was the reason for this
design decision? It doesn't save any space, since on most systems
UTF_SIZ == sizeof(int) == sizeof(wchar_t).
That  design decision can change when I’m actually implementing the dou‐
ble‐width and double‐height support in st. The codebase is small  enough
to change such a type in less than 10 minutes. So no religion was intro‐
duced here.

The reason for my question about using codepoints instead of UTF-8 was because I thought it might make it easier to support combining diacritics, not wide characters. The two problems are broadly related because both of them affect the number of character cells occupied by a string.

And I don't know the st codebase well enough (or at all, really) to tell
at a glance what would have to be changed to be able to support a
double-width character cell, or to support wrapping to the next line if
one is output at the second-to-last column.
I hadn't yet the time to read all the double-width implementations in other
terminals so st would do the »right thing« in implementing all questionable
cases.

Double‐width characters are like BCE a design decision applications need
adapt to.

Some corner cases I haven't yet found a good answer to:
        * Is there any standard for this except for setting the flag in
          terminfo and taking up two cells in the terminal?

I don't know if there's a standard. I can find nothing about character cell terminals in any UTR, and ECMA 48 is silent on the question of wide characters.

I don't know what terminfo flag you are referring to. I was talking about support for east asian characters, not VT100-style stretching of ASCII characters. I suspect the widcs/swidm/rwidm capabilities refer to the latter (though the only actual instance in the terminfo database is a swidm string on the att730).


Observed behavior in various terminals that do support them is:
* cursor position can be in either half of a double character, though the whole character is hilighted (all observed terminals) * outputting one at the end of the line (i.e. where a pair of two narrow characters would be split across lines) fails entirely (xterm) or wraps to the next line leaving the last cell alone (vte, tmux, mlterm, kterm). * outputting a narrow character on top of a wide character erases the entire wide character (xterm, tmux, mlterm, kterm) or erases only when in the left half (vte)

* deleting (e.g. with ESC [ P) part of a character has various different behaviors: ** on xterm and kterm, deleting either half of a character replaces the remaining half with a single-width blank space. ** tmux's behavior is very buggy: a vertical line drawn across a different part of the screen _after_ deleting different parts of wide characters on different lines ended up redrawing incorrectly after refreshing. As for the wide characters themselves, deleting the left half deletes the entire character and deleting the right half has no effect, but there is some hidden state involved - a sequence of two deletions will delete a single wide character. I suspect the "right half" is filled with some placeholder value that is not output to the host terminal, and they are deleted individually. This is consistent with all of my observations. ** on mlterm, deleting the left half of a character deletes the entire character; deleting the right half replaces it with two spaces. ** on vte, deleting the right half of a character replaces the _next_ character with a space. Deleting the left half replaces the present character with a space, but seems to leave some hidden state, since the cursor on this "space" is still double width. * the xterm/kterm behavior seems the most rational, since it yields no visual glitches, always keeps the cursor in the same logical position, and a deletion always shifts characters right of it by the same amount.

I haven't made any detailed investigation into the actual set of characters that are considered wide (or combining) by each terminal and by various applications, (except tmux, which has a list of ranges in utf8.c). I also haven't investigated whether any of them have locale-dependent treatment of "ambiguous" characters (e.g. greek or cyrillic) which are wide in historical east asian fonts (except tmux, which does not)

mlterm does have an option that makes it work differently; the above results are with -Z enabled.

        * If st has double-width default.
                * What happens if the application does naive character
                  counting? Will layouts break?

My experience is that layouts break now. I'm not sure if I can think of an application that would break that wouldn't break already due to UTF-8 support (counting bytes).

                * Is there some way to tell the application that we have
                  double-width support enforced except for the terminfo?
I would argue that an application that doesn't expect wide character support shouldn't be outputting CJK characters.
                * How do applications implement this? Is there some historical
                  cruft that will break?
I can't speak for every application ever, but I did observe that zsh breaks when confronted with characters that should be wide, in the prompt, being treated as narrow. I haven't ever heard of anything breaking in ways related to this (as opposed to e.g. by byte counting on UTF-8) with the behavior currently implemented in other terminal emulators.
        * With an option to toggle the double-width handling:
                * Is this needed for tmux, screen or other terminal proxies
                  that for example miss BCE too?

tmux does support it. I don't know about screen.

mlterm has such an option. with mlterm's implementation, characters are still visually wide which leads to some visual glitches and surprising cursor movement behavior.
These  are  the questions I miss an answer too before implementing this.
The code isn’t a problem.


Sincerely,

Christoph Lohmann




Reply via email to