In addition to the issues I raised before about consistency of width
under canonical equivalence, I've found additional problems in the
width definitions which are not technical issues like before, but just
feasibility-of-presentation issues. Specifically, several Indic
scripts including Kannada and Malayalam have several characters which
require 6 or 7 vertical strokes for their standard presentation
glyphs, and numerous characters that require 4 or 5. Moreover, the
standard glyphs shapes for these characters are roughly twice as wide
(sometimes more than twice) as they are tall.

This puts their horizontal complexity on par with most ideographic
characters, and makes it impossible to render them legibly in a single
character cell without huge font size. The possible courses of action
are:

1. Leave them with wcwidth of 1 anyway and assume everyone will use
   huge font sizes or else put up with completely illegible glyphs.

2. Assign a global wcwidth of 2 to the affected scripts.

3. Perform "a careful analysis not only of each Unicode character,
   but also of each presentation form", as Markus suggested in his
   wcwidth.c comments, assigning width of 1/2[/3??] on a per-character
   basis.

IMO course 1 is ridiculous. The only argument for it is compatibility,
but obviously no one has ever tried using wcwidth with these scripts
since it just plain doesn't work.

Course 3 is difficult but might give the most visually pleasing
results. On the other hand, it may tend to lock one into a particular
style of presentation forms. If preferred glyph forms change due to
"reforms" or just stylistic preferences, users could be left with a
mess. Part of the analysis for #3 would have to include making sure
that the width assignments could remain reasonable under such
variations, as opposed to being font-specific, but this is probably
not infeasible as long as the amount of "width>1" characters is kept
to a minimum.

Finally there's course 2. In a way it's sort of a cop-out, taking the
easy approach of "fixed width", but that's what character cell widths
have done ever since "i" and "m" received the same width of 1 column.
It's font-independent and ensures that text in a single script can
align well in columns regardless of which characters are used.

I can prepare example bitmaps if anyone is interested in seeing what
the choices might look like, and probably will do this soon anyway.
Again, my goal is revising the wcwidth data (which Markus labelled as
incomplete in the original version) to account for scripts for which
it is not currently being used and for which it does not currently
provide reasonable results. But it's useless for me to just say what I
think it should be. There should be some sort of sane process here, by
which we arrive at a de facto standard which glibc and other
implementations can adopt.

Rich


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to