Re: wcwidth

Markus Kuhn Mon, 25 Sep 2000 09:41:47 -0700
Bruno Haible wrote on 2000-09-25 15:14 UTC:
> I think your wcwidth implementation
> (http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c) and mine (in libutf8)
> should be changed as follows:
> 
>   * Make the LINE SEPARATOR and PARAGRAPH SEPARATOR (categories Zl and
>     Zp) non-printable, i.e. wcwidth returns -1 for them. These two
>     separators are modern forms of U+000A and U+000C and should be
>     handled like them.
> 
>   * Make the characters of category Cf have width 0. The Unicode 3.0
>     book, in the section about U+200E, U+200F, U+202A..U+202E, talks
>     about "the other zero-width characters", implying that they are
>     zero-width anthough they are not listed as Non-Spacing in PropList.txt.
> 
> What do you think?

My wcwidth() implementation was intended to give an accurate prediction
of how many character cells an xterm uses with a charcell font. In fact,
it is now what xterm uses, and I do hope that it will become a bit
of a defacto standard for other terminal emulators as well.

LS, PS, and zero-width space characters (along with Hangul Johab, Indic
characters, etc.) are all treated like unassigned characters in xterm
and you will get a default character printed for them, occupying one
character cell. Therefore the correct output is wcwidth() == 1 here.
Anything else will just create confusion and would make wcwith() useless
for predicting the cursor position on a biwidth charcell output device,
which is its one and only application.

The LS/PS characters were not intended for applications such as talking
to a terminal emulator, so please don't send them to a terminal
emulator. The zero-width spaces/joiner are only required for ligature
output. This is for the forseeable future probably outside the scope of
VT100-like terminal emulators, and therefore also outside the scope of
wcwidth(). All these characters will just produce the default character
on the terminal screen, hence wcwidth() == 1 here as well. Same for the
remaining Cf characters.

Don't read too much in the Unicode book too much about esoteric
non-ASCII control characters, Johab, etc..  They are all just empty
boxes. For terminal emulators, ISO 10646-1 is usually less confusing
literature. We must make sure that we keep the UTF-8 terminal emulator
semantics simple and easy to understand, and I am now very happy with
what we have reached in the latest xterm with Robert's patches applied.
This looks like a robust and practically useful standard to me that I
hope will get widely adopted. For more, turn to Pango not ISO 6429.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: wcwidth

Reply via email to