Re: [dev] [st] wide characters

Random832 Sun, 14 Apr 2013 07:56:43 -0700

On 04/14/2013 02:10 AM, Christoph Lohmann wrote:

Greetings.


On Sun, 14 Apr 2013 08:10:22 +0200 Random832 <random...@fastmail.us> wrote:

I am forced to ask, though, why character cell values are stored in
utf-8 rather than as wchar_t (or as an explicitly unicode int) in the
first place, particularly since the simplest way to detect a wide
character is to call the function wcwidth. What was the reason for this
design decision? It doesn't save any space, since on most systems
UTF_SIZ == sizeof(int) == sizeof(wchar_t).

That  design decision can change when I’m actually implementing the dou‐
ble‐width and double‐height support in st. The codebase is small  enough
to change such a type in less than 10 minutes. So no religion was intro‐
duced here.

The reason for my question about using codepoints instead of UTF-8 wasbecause I thought it might make it easier to support combiningdiacritics, not wide characters. The two problems are broadly relatedbecause both of them affect the number of character cells occupied by astring.

And I don't know the st codebase well enough (or at all, really) to tell
at a glance what would have to be changed to be able to support a
double-width character cell, or to support wrapping to the next line if
one is output at the second-to-last column.

I hadn't yet the time to read all the double-width implementations in other
terminals so st would do the »right thing« in implementing all questionable
cases.

Double‐width characters are like BCE a design decision applications need
adapt to.

Some corner cases I haven't yet found a good answer to:
        * Is there any standard for this except for setting the flag in
          terminfo and taking up two cells in the terminal?

I don't know if there's a standard. I can find nothing about charactercell terminals in any UTR, and ECMA 48 is silent on the question of widecharacters.

I don't know what terminfo flag you are referring to. I was talkingabout support for east asian characters, not VT100-style stretching ofASCII characters. I suspect the widcs/swidm/rwidm capabilities refer tothe latter (though the only actual instance in the terminfo database isa swidm string on the att730).



Observed behavior in various terminals that do support them is:

* cursor position can be in either half of a double character, thoughthe whole character is hilighted (all observed terminals)* outputting one at the end of the line (i.e. where a pair of two narrowcharacters would be split across lines) fails entirely (xterm) or wrapsto the next line leaving the last cell alone (vte, tmux, mlterm, kterm).* outputting a narrow character on top of a wide character erases theentire wide character (xterm, tmux, mlterm, kterm) or erases only whenin the left half (vte)

* deleting (e.g. with ESC [ P) part of a character has various differentbehaviors:** on xterm and kterm, deleting either half of a character replaces theremaining half with a single-width blank space.** tmux's behavior is very buggy: a vertical line drawn across adifferent part of the screen _after_ deleting different parts of widecharacters on different lines ended up redrawing incorrectly afterrefreshing. As for the wide characters themselves, deleting the lefthalf deletes the entire character and deleting the right half has noeffect, but there is some hidden state involved - a sequence of twodeletions will delete a single wide character. I suspect the "righthalf" is filled with some placeholder value that is not output to thehost terminal, and they are deleted individually. This is consistentwith all of my observations.** on mlterm, deleting the left half of a character deletes the entirecharacter; deleting the right half replaces it with two spaces.** on vte, deleting the right half of a character replaces the _next_character with a space. Deleting the left half replaces the presentcharacter with a space, but seems to leave some hidden state, since thecursor on this "space" is still double width.* the xterm/kterm behavior seems the most rational, since it yields novisual glitches, always keeps the cursor in the same logical position,and a deletion always shifts characters right of it by the same amount.

I haven't made any detailed investigation into the actual set ofcharacters that are considered wide (or combining) by each terminal andby various applications, (except tmux, which has a list of ranges inutf8.c). I also haven't investigated whether any of them havelocale-dependent treatment of "ambiguous" characters (e.g. greek orcyrillic) which are wide in historical east asian fonts (except tmux,which does not)

mlterm does have an option that makes it work differently; the aboveresults are with -Z enabled.

        * If st has double-width default.
                * What happens if the application does naive character
                  counting? Will layouts break?

My experience is that layouts break now. I'm not sure if I can think ofan application that would break that wouldn't break already due to UTF-8support (counting bytes).

                * Is there some way to tell the application that we have
                  double-width support enforced except for the terminfo?

I would argue that an application that doesn't expect wide charactersupport shouldn't be outputting CJK characters.

                * How do applications implement this? Is there some historical
                  cruft that will break?

I can't speak for every application ever, but I did observe that zshbreaks when confronted with characters that should be wide, in theprompt, being treated as narrow. I haven't ever heard of anythingbreaking in ways related to this (as opposed to e.g. by byte counting onUTF-8) with the behavior currently implemented in other terminal emulators.

        * With an option to toggle the double-width handling:
                * Is this needed for tmux, screen or other terminal proxies
                  that for example miss BCE too?


tmux does support it. I don't know about screen.

mlterm has such an option. with mlterm's implementation, characters arestill visually wide which leads to some visual glitches and surprisingcursor movement behavior.

These  are  the questions I miss an answer too before implementing this.
The code isn’t a problem.


Sincerely,

Christoph Lohmann

Re: [dev] [st] wide characters

Reply via email to