In addition to the issues I raised before about consistency of width under canonical equivalence, I've found additional problems in the width definitions which are not technical issues like before, but just feasibility-of-presentation issues. Specifically, several Indic scripts including Kannada and Malayalam have several characters which require 6 or 7 vertical strokes for their standard presentation glyphs, and numerous characters that require 4 or 5. Moreover, the standard glyphs shapes for these characters are roughly twice as wide (sometimes more than twice) as they are tall.
This puts their horizontal complexity on par with most ideographic characters, and makes it impossible to render them legibly in a single character cell without huge font size. The possible courses of action are: 1. Leave them with wcwidth of 1 anyway and assume everyone will use huge font sizes or else put up with completely illegible glyphs. 2. Assign a global wcwidth of 2 to the affected scripts. 3. Perform "a careful analysis not only of each Unicode character, but also of each presentation form", as Markus suggested in his wcwidth.c comments, assigning width of 1/2[/3??] on a per-character basis. IMO course 1 is ridiculous. The only argument for it is compatibility, but obviously no one has ever tried using wcwidth with these scripts since it just plain doesn't work. Course 3 is difficult but might give the most visually pleasing results. On the other hand, it may tend to lock one into a particular style of presentation forms. If preferred glyph forms change due to "reforms" or just stylistic preferences, users could be left with a mess. Part of the analysis for #3 would have to include making sure that the width assignments could remain reasonable under such variations, as opposed to being font-specific, but this is probably not infeasible as long as the amount of "width>1" characters is kept to a minimum. Finally there's course 2. In a way it's sort of a cop-out, taking the easy approach of "fixed width", but that's what character cell widths have done ever since "i" and "m" received the same width of 1 column. It's font-independent and ensures that text in a single script can align well in columns regardless of which characters are used. I can prepare example bitmaps if anyone is interested in seeing what the choices might look like, and probably will do this soon anyway. Again, my goal is revising the wcwidth data (which Markus labelled as incomplete in the original version) to account for scripts for which it is not currently being used and for which it does not currently provide reasonable results. But it's useless for me to just say what I think it should be. There should be some sort of sane process here, by which we arrive at a de facto standard which glibc and other implementations can adopt. Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/