Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

Rich Felker Sun, 05 Nov 2006 23:57:27 -0800

On Sun, Nov 05, 2006 at 12:59:03PM -0800, rajeev joseph sebastian wrote:
> Well, most correctly implemented Unicode-aware applicatons do this also:
> have 2 backing stores, one for text and the other for glyphs. Use
> the glyph representation for display. When a selection is done, the
> map between the 2 stores is used to derive the correct text for the
> selected glyphs.


Yes, this is roughly what uuterm does (except it doesn't keep a glyph
representation, it just dynamically-generates it). However
applications running on the terminal don't have any way to know about
glyphs; all they can access are the characters.

> Currently, most apps I have seen use the precomposed Latin
> characters, which is allowed only because of the stability policy.
> Most apps do not implement complex layout of latin glyphs which
> causes no-end of problems for Latin transliterations of Indic/other
> text. Although most of the required characters for Indic
> transliteration are already available precomposed, the policy of
> Unicode and the combining mark model do not allow the rest to be
> encoded. Hence the proliferation of PUA codepoints for this purpose.
> (I hope the situation changes for GNU/Linux, but I think it is
> unlikely).

uuterm already has full support for combining marks, including varied
placement of the diacritics. It doesn't use precomposed glyphs even if
they're available; it always decomposes to NFD (with some additional
decompositions necessary because of stupid Unicode policies) for
rendering.

> ----------
> The other issue here is that there's no standard for glyph numbering,
> and Unicode doesn't represent glyphs, so there's really no way an
> application running on a terminal could directly print glyphs. Even if
> it could, just "cat file_with_indic_text.txt" on the terminal, or
> something simple like "ls", wouldn probably not work as expected.
> ------------
> There is no need for glyph numbers and that is one the strong points
> of Unicode.

I agree totally. However it does mean that applications running on a
terminal don't have any way to operate in terms of glyphs. Everything
they do must be in terms of characters. This is why we're only able to
consider character width and not glyph width for the purposes of
spacing.

> I would strongly suggest to look over the HarfBuzz
> library which is slowly evolving which will allow you to use the
> work of the best minds in the community. It will transform
> codepoints into glyphs, which you can then use. (You can also use
> Pango if need be).

uuterm is based entirely on bitmap fonts, so these are not appropriate
solutions for it and probably not for kernel-level console drivers
either. However, any character-width tables agreed upon should be able
to be used reasonably with OpenType fonts too of course. It would be
silly to try to adopt a standard that excludes a popular modern
technology. However just like with Latin, fonts whose metrics don't
fit well with the cell widths wouldn't look very good in a terminal
emulator.

IMO, in a way this is part of an argument for the "excessive" spacing
too -- if there's extra space you can fit almost any font in there...
and optionally scale it to try to fill up the space if desired, or
distribute the extra spacing equally spread-out, etc.

> My (naive)
> understanding is that Kannada conjuncts take place mostly as a
> "subscript" to the bottom-right of the initial consonant and vowel
> mark, so perhaps they'll look fairly proper in such a scheme.
> 
> -------
> This is not always true. For Kannada, I will try to confirm that.

I have a friend I can check with too, but going from the sparse
information in the Unicode specs and sites like Omniglot and
Wikipedia, it seems to be true that even 'subjunct' conjunct
characters use some of their own horizontal space. Sometimes
characters that would definitely need 2 cells on their own are simple
enough to fit in one cell when they are a subjunct character though,
so spacing is not entirely ideal, but the glyphs I experimented with
drawing seemed to fit legibly anyway. I can send you the xbm files if
you're interested in seeing. (They're not hideously ugly like the
ascii art below.. :)

> If you mean to say that each logical cluster will be allocated
> enough width equal to the sum of the widths of each character in
> that cluster, then I think you will allocate much too much space :)

Yes, I know. :) But given the choice between too much and not enough,
too much is better.

Can I ask you if something like the following (aside from the bad
ascii art :) is horribly offensive:

pa:
       #
       #
  ##   #
 #  #  #
 #  #  #
   #####

ppa:
              #
   ##         #
  #  #        #
  #  #        #
    ###########
              #
    ##        #
   #  #       #
     ##########
(became wide because it was allocated 2 spaces due to two "pa"
characters..)

Hopefully these pictures explain a bit of one way that excess space
could be filled up. Whether it looks reasonable or not, I don't know,
but I suspect it's better than leaving empty space.

> Yes.. it's not really a curses problem though. As long as the terminal
> supports reordering and ligatures, using curses should not be much of
> a problem. I still need to write the reordering stuff for uuterm
> though.
> 
> ----------
> I strongly suggest to look over HarfBuzz library.

I've looked at it before, but much like uuterm it's hardly documented.
A bit of RTFS'ing suggested that it's also excessively complex in
terms of the data structures it uses. :(

In any case, while the HarfBuzz library can handle glyph selection for
a program using OpenType fonts, and likewise the spacing, there's
nothing it can do to solve the problem of spacing on a terminal. This
is because the metrics returned by font libraries are inherently
font-specific, whereas the spacing on a terminal must be
font-independent (since the application attached to the terminal has
no knowledge of the font being used).

I'm not sure if I'll be able to work out any kind of presentation
scheme you'll find acceptable. If not, I'm sorry, but I simply don't
have the time or resources to rewrite the display handling of every
single application which runs on a terminal to make them all aware of
complex spacing interactions, and even if I did, i don't think anyone
has any idea what the _right_ system for this would be.

What I have already been able to do is make a lot of languages which
were previously unusable on terminals usable through simple but
powerful context-sensitive shaping. This is a much easier problem to
solve than context-sensitive spacing. What I can (and hope to)
continue to do is find ways that additional languages/scripts can be
supported without any unreasonable degree of ugliness. It looks like
Kannada will fit pretty well into this system, and Hindi fits ok aside
from the excessive space left when "ra" becomes a nonspacing mark. If
other Indic and Indic-derived scripts work, great! If Burmese
(supposedly very difficult) manages to work that will make me very
happy. Regardless of whether it's ugly or not, though, I think it
would be nice (and beneficial to some users at least) to have
Malayalam supported at least minimally.

> Could you post a
> link to uuterm development website ?

These are the various relevant links:

http://svn.mplayerhq.hu/uuterm/trunk/
svn://svn.mplayerhq.hu/uuterm/trunk/
http://brightrain.aerifal.cx/~dalias/uuterm/screenshots/
http://brightrain.aerifal.cx/~dalias/ucf/fonts/

Sorry the documentation is so sparse. I'm presently working on getting
nice character coverage in the default distribution so that I can
promote uuterm without potential users saying "wtf how am I supposed
to use this when there are no fonts?!"

Rich


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

Reply via email to