On Thu, 28 Nov 2002, Jungshik Shin wrote: > On Thu, 28 Nov 2002, Owen Taylor wrote: > > > The path to adding full beyond-the-BMP support to Pango is > > pretty straightforward. (I'm a little suprised that it doesn't > > sort of work now for TrueType fonts, but I haven't tested > > it at all.) > > So, what I wrote about 'UTF-32 cleanness' was not the case. There are > some libraries that support BMP only for the momemnt. As for Pango,
It turned out that Pango/Glib are not the only libraries that need to be modified a bit to be UTF-32-clean. Xft and Freetype2 also have a little problem with UTF-32 support although all the data structures used in both have been UTF-32-clean from the beginning. Because MS IE5.5 or later can render non-BMP characters well with fonts like Code2001, I decided to put Mozilla on par with it. It's relatively easy(for Mozilla-Xft) and I went so far as to get Mozilla to draw nice 6-hex digit unknown character glyphs for unknown non-BMP characters. (for unknown chara. in BMP, it still draws 4-hex digit unkn. char. glyph). It's good to see 6-digit unknown char. glyph work, but it's disappointing to see them show up even for characters covered by a font(CODE2001.TTF) I have on my fontconfig-search path and I explicitly specified to use via CSS. Why? It's simple. FcCharSetHasChar is returning false for non-BMP characters (e.g. U+10331) although Code2001 has them. I wrote a couple of test programs, one to test fontconfig and the other to freetype2(2.1.3. the latest stable release released a couple of weeks ago) [1]. I found out the cause and I'm gonna enclose my patch. Why does FcCharSetHasChar fail for non-BMP characters? It's because fontconfig calls FT_Select_CharMap() instead of FT_Set_CharMap(). fontconfig doesn't use the latter apparently because Keith didn't want to deal with encodings (to make fontconfig portable and to be able to deal with legacy multibyte encodings, it needs to have a built-in conversion routine, which would bloat the size of fontconfig.) other than Unicode, AppleRoman and AdobeSymbol. When FT_Select_CharMap() is called with 'FT_ENCODING_UNICODE'(or deprecated ft_encoding_unicode), freetype activates the first cmap with Unicode encoding for subsequent operations on a font until another cmap is activated. It's not a problem for fonts covering BMP only. However, fonts like Code2001 has multiple Cmaps all with the identical symbolic FT encoding 'FT_ENCODING_UNICODE' but with different char. coverage. Code2001 has 4 cmaps, pid=0,eid=0(Unicode), pid=1,eid=1(AppleRoman), pid=3(MS),eid=1(Unicode) and pid=3(MS),eid=10(Unicode). Only the last cmap has non-BMP characters although the first and the third are also Unicode cmap. They're actually UCS-2 cmap. As mentioned above, Freetyp2 makes the first cmap matching 'symbolic encoding name' active and unfortunately that happens to be the one not covering non-BMP characters. One may say that the font (CODE2001) is to blame and pid=0/eid=0 and pid=3/eid=3 cmaps should have non-BMP characters covered as well. However, it's not very clear that it has to according to the MS document at <http://www.microsoft.com/typography/otspec/cmap.htm>. Even if it has to, I think Freetype2 has to be defensive and provide a workaround because there may be some fonts lying around with similar problems. One possible solution is to return not the first cmap table matching the symbolic encoding name of 'FT_ENCODING_UNICODE' but to keep on looking to see if pid=3/eid=10 cmap is also present. If it is, it has to be activated instead of the first Unicode cmap found. Alternative is to introduce a new symbolic encoding name, 'FT_ENCODING_UCS4' (or UTF32) to distinguish pid=3/eid=10 cmap from other unicode cmaps (pid=0/eid=0, pid=1/eid=?, pid=3/eid=1) which appear to be UCS-2 only in most cases. In that case, consumers of FT2 libraries (e.g. fontconfig) have to be modified as well. If non-BMP chars. are dealt with, FT_ENCODING_UCS4 cmap has to be requested instead of FT_ENCODING_UNICODE. I thought the first is better (with a little performance penalty arising from having to keep on looking after hitting the first Unicode cmap) and it worked well with Mozilla (see <http://bugzilla.mozilla.org/show_bug.cgi?id=182877>). I also have to extend XftTextExtents16() included in fcpackage-2.1 to deal with UTF-16 (instead of UCS-2). Xft2 has XftDrawStringUtf16() in addition to XftDrawString16() (the latter is for UCS-2). I thought about adding XftTextExtentsUtf16(), but it appears that it's more convenient for programs like Mozilla which uses UTF-16 for internal string representation when XftTextExtents16() is extended to support UTF-16. Again, there's a little speed penalty. Below are links to FT2 patch (against 2.1.3) and Xft patch (against fcpackage 2.1) http://bugzilla.mozilla.org/attachment.cgi?id=107852 : FT2 patch http://bugzilla.mozilla.org/attachment.cgi?id=107858 : Xft patch There are a couple of screenshots along with Mozilla patch and a couple of sample pages with non-BMP characters at http://bugzilla.mozilla.org/show_bug.cgi?id=182877 I believe Werner is on this list so that I won't write to him separately for a while. Werner, if you find that my patch makes sense, it'd be nice to apply it to Freetype2. BTW, it just occurred to me that the routine setting the default Cmap for a newly opened FT_Face has to be modified in a similar manner. (currently, it sets the first-found Unicode Cmap as the default, but the first-matched Unicode Cmap may not be the most extensive one as I explained above.) Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/