On Sun, Nov 05, 2006 at 12:59:03PM -0800, rajeev joseph sebastian wrote: > Well, most correctly implemented Unicode-aware applicatons do this also: > have 2 backing stores, one for text and the other for glyphs. Use > the glyph representation for display. When a selection is done, the > map between the 2 stores is used to derive the correct text for the > selected glyphs.
Yes, this is roughly what uuterm does (except it doesn't keep a glyph representation, it just dynamically-generates it). However applications running on the terminal don't have any way to know about glyphs; all they can access are the characters. > Currently, most apps I have seen use the precomposed Latin > characters, which is allowed only because of the stability policy. > Most apps do not implement complex layout of latin glyphs which > causes no-end of problems for Latin transliterations of Indic/other > text. Although most of the required characters for Indic > transliteration are already available precomposed, the policy of > Unicode and the combining mark model do not allow the rest to be > encoded. Hence the proliferation of PUA codepoints for this purpose. > (I hope the situation changes for GNU/Linux, but I think it is > unlikely). uuterm already has full support for combining marks, including varied placement of the diacritics. It doesn't use precomposed glyphs even if they're available; it always decomposes to NFD (with some additional decompositions necessary because of stupid Unicode policies) for rendering. > ---------- > The other issue here is that there's no standard for glyph numbering, > and Unicode doesn't represent glyphs, so there's really no way an > application running on a terminal could directly print glyphs. Even if > it could, just "cat file_with_indic_text.txt" on the terminal, or > something simple like "ls", wouldn probably not work as expected. > ------------ > There is no need for glyph numbers and that is one the strong points > of Unicode. I agree totally. However it does mean that applications running on a terminal don't have any way to operate in terms of glyphs. Everything they do must be in terms of characters. This is why we're only able to consider character width and not glyph width for the purposes of spacing. > I would strongly suggest to look over the HarfBuzz > library which is slowly evolving which will allow you to use the > work of the best minds in the community. It will transform > codepoints into glyphs, which you can then use. (You can also use > Pango if need be). uuterm is based entirely on bitmap fonts, so these are not appropriate solutions for it and probably not for kernel-level console drivers either. However, any character-width tables agreed upon should be able to be used reasonably with OpenType fonts too of course. It would be silly to try to adopt a standard that excludes a popular modern technology. However just like with Latin, fonts whose metrics don't fit well with the cell widths wouldn't look very good in a terminal emulator. IMO, in a way this is part of an argument for the "excessive" spacing too -- if there's extra space you can fit almost any font in there... and optionally scale it to try to fill up the space if desired, or distribute the extra spacing equally spread-out, etc. > My (naive) > understanding is that Kannada conjuncts take place mostly as a > "subscript" to the bottom-right of the initial consonant and vowel > mark, so perhaps they'll look fairly proper in such a scheme. > > ------- > This is not always true. For Kannada, I will try to confirm that. I have a friend I can check with too, but going from the sparse information in the Unicode specs and sites like Omniglot and Wikipedia, it seems to be true that even 'subjunct' conjunct characters use some of their own horizontal space. Sometimes characters that would definitely need 2 cells on their own are simple enough to fit in one cell when they are a subjunct character though, so spacing is not entirely ideal, but the glyphs I experimented with drawing seemed to fit legibly anyway. I can send you the xbm files if you're interested in seeing. (They're not hideously ugly like the ascii art below.. :) > If you mean to say that each logical cluster will be allocated > enough width equal to the sum of the widths of each character in > that cluster, then I think you will allocate much too much space :) Yes, I know. :) But given the choice between too much and not enough, too much is better. Can I ask you if something like the following (aside from the bad ascii art :) is horribly offensive: pa: # # ## # # # # # # # ##### ppa: # ## # # # # # # # ########### # ## # # # # ########## (became wide because it was allocated 2 spaces due to two "pa" characters..) Hopefully these pictures explain a bit of one way that excess space could be filled up. Whether it looks reasonable or not, I don't know, but I suspect it's better than leaving empty space. > Yes.. it's not really a curses problem though. As long as the terminal > supports reordering and ligatures, using curses should not be much of > a problem. I still need to write the reordering stuff for uuterm > though. > > ---------- > I strongly suggest to look over HarfBuzz library. I've looked at it before, but much like uuterm it's hardly documented. A bit of RTFS'ing suggested that it's also excessively complex in terms of the data structures it uses. :( In any case, while the HarfBuzz library can handle glyph selection for a program using OpenType fonts, and likewise the spacing, there's nothing it can do to solve the problem of spacing on a terminal. This is because the metrics returned by font libraries are inherently font-specific, whereas the spacing on a terminal must be font-independent (since the application attached to the terminal has no knowledge of the font being used). I'm not sure if I'll be able to work out any kind of presentation scheme you'll find acceptable. If not, I'm sorry, but I simply don't have the time or resources to rewrite the display handling of every single application which runs on a terminal to make them all aware of complex spacing interactions, and even if I did, i don't think anyone has any idea what the _right_ system for this would be. What I have already been able to do is make a lot of languages which were previously unusable on terminals usable through simple but powerful context-sensitive shaping. This is a much easier problem to solve than context-sensitive spacing. What I can (and hope to) continue to do is find ways that additional languages/scripts can be supported without any unreasonable degree of ugliness. It looks like Kannada will fit pretty well into this system, and Hindi fits ok aside from the excessive space left when "ra" becomes a nonspacing mark. If other Indic and Indic-derived scripts work, great! If Burmese (supposedly very difficult) manages to work that will make me very happy. Regardless of whether it's ugly or not, though, I think it would be nice (and beneficial to some users at least) to have Malayalam supported at least minimally. > Could you post a > link to uuterm development website ? These are the various relevant links: http://svn.mplayerhq.hu/uuterm/trunk/ svn://svn.mplayerhq.hu/uuterm/trunk/ http://brightrain.aerifal.cx/~dalias/uuterm/screenshots/ http://brightrain.aerifal.cx/~dalias/ucf/fonts/ Sorry the documentation is so sparse. I'm presently working on getting nice character coverage in the default distribution so that I can promote uuterm without potential users saying "wtf how am I supposed to use this when there are no fonts?!" Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/