Re: [Fonts]Automatic 'lang' determination
Keith Packard wrote: >Around 14 o'clock on Jun 29, Yao Zhang wrote: > >>Sure, I will install as many Chinese fonts as possible and get the >>fonts.cache for you. But before that, I will show you serveral lines in my >>fonts.cache: >> > >I'm afraid the mailers corrupted the rather long lines in those files, but >given that I've discovered that GB2312 is a relatively strong test for >suitability for simplified chinese, perhaps we can avoid sending this data >at all. > >>Now for lang, ZYSong18030 is labelled as >>lang=simplifiedchinese >>while SimSun-18030 is labelled as >> >lang=latin1,arabic,simplifiedchinese,koreanwansung,traditionalchinese,koreanjohab,arabic864,arabicasmo708,us >> > >These language tags come from the OS/2 table and are set by the font >designer. If, as our friend Jungshik Shin says, simplified forms were >not unified with traditional forms in the BMP, then it's quite reasonable >to build a font that can cover both languages. > Although both zysong and simsun are both from Beijing Zhongyi, but zysong in Red Hat 7.3 is purely a GB18030 font file, it only contains the characters defined in GB18030 standard. And simsun does provide extra characters to support other language like japanese etc. So the os2 table says so. Regards, Shao > >With the new improved GB2312-based simplified test, I suspect the correct >languages would be generated automatically from this font as well. > >I've gone ahead and committed the changes necessary for automatic lang >determination to XFree86 CVS; those interested in verifying it's >sensitivity and specificity are welcome to check it out and run: > > $ FC_DEBUG=256 fc-cache -f > >This will display the number of missing glyphs in each language for each >font and also display errors in the lang value relative to that specified >in the TrueType file. > >Keith PackardXFree86 Core TeamHP Cambridge Research Lab > > >___ >Fonts mailing list >[EMAIL PROTECTED] >http://XFree86.Org/mailman/listinfo/fonts > ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
[Fonts]Re: Han unification(SC and TC)(was..Re: Automatic 'lang' determination )
On Saturday, June 29, 2002, at 12:31 PM, Jungshik Shin wrote: > I'm afraid what you have heard of BMP section is misleading if > I understood you correctly. Whether in BMP or not, simplified forms of > Chinese characters are NOT UNIFIED with traditional forms of Chinese > characters. (let me copy my message to John H. Jenkins @Apple who knows a > lot more about Han Unification than I do.) This is correct. The interconversion between SC and TC is in general m-to-n, and so unification would not have been possible. Where a character is simply "written differently" in the PRC from Taiwan and elsewhere, they are unified (e.g., U+988A), and where an already extant character is used as a simplification for another, the older character and the simplified character are unified (e.g., U+53F0, which is both a TC character in its own right and the simplification for other characters, such as U+98B1). This is done, however, only because the SC form is seen as separate from its TC counterpart(s). > AFAIK, most complaints about > Han unification does NOT come from zh-CN vs zh-TW BUT from zh-CN/zh-TW > vs ja. For Han characters common in both zh-CN and zh-TW, there's no > significant difference in appearence between zh-CN and zh-TW. Actually, there are some exceptions to this. U+988A and characters containing it make up the bulk of this. In general, however, you're quite correct. > Although > many Japanese would not agree with me, I don't think there's any > significant difference across CJKV. Also correct. It's on the order of "color" vs. "colour". In the bulk of the cases which have been unified, all the unified forms will be recognized by native readers of all the languages involved, even if they may look a little "funny." > (again, ISO 10646 Han chart is a > good reference along with ROC MOE's Han character variant dictionary at > http://140.111.1.40) To me, Han Unification should have gone further (not > less) in a sense and it's worrisome to me that non-BMP includes too many > glyph variants (a whole bunch of them coming from Korean Buddist text : > see http://www.sutra.re.kr) that should have been unified in my eyes. > *sigh* This is also true. We should have pushed harder on the IRG during the Extension B work to keep this very thing from happening. == John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jenkins/ ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
Around 1 o'clock on Jun 30, Pablo Saratxaga wrote: > What are those glyphs? (I'm quite surprised, I would have expected the > opposite: fonts generally have more glyphs than the standard encodings of > the sio-8859 family for example) My definition of language tag is coloured by the OS/2 table codePageRange bits from which is was originally defined in fontconfig. Those bits are defined to map to specific Windows code pages; the Latin-1 case doesn't map to ISO 8859-1, but rather to code page 1252 for which many fonts are missing a few random entries. Similarly for the other tags, the existing fonts that I have don't generally seem to cover the complete windows code page from which the codePageRange bit was derived. > No, the tolerance for missing glyphs in CJK tests should be the same or > even smaller. The difference is that it isn't needed to test all the glyphs > for CJK coverages; testing only a set of 256 choose glyphs would be enough > (if they are correctly choosen, testing that 256 glyphs are present in a > font is enough to assure, with 99.99% of confidence, that it covers a given > CJK language). I'm not confident enough of this approach; I fear that any set of 256 glyphs that must appear in a simplified Chinese font may well appear in many traditional Chinese (or even Japanese) fonts. Certainly we could experimentally determine a reasonable subset, and it's completely trivial to change the matching table used in the code. > Of course, complete checking can also be done, but I wonder if it is > actually useful (I mean, is there a font suitable for simplified chinese > out there that doesn't encode all the characters of gb2312? It seems that this must be the case -- I set the '500' number so high because all of the fonts which I have that advertise support for simplified Chinese are missing over 200 glyphs from GB2312. I got similar results for Japanese fonts, Korean Wansung fonts and traditional Chinese fonts. I would need a significantly larger set of fonts than I currently have access to if I wanted to generate smaller test char sets. Now that the tests stand in isolation, perhaps those skilled with particular languages can develop more specific tests. > But to handle such case, I think it would be better to choose a given > definition of "big5" (or several of them) and stick to it, rather than > allowing a so tremendously big hole as 500 possible missing chars. Missing 500 from a repertoire of nearly 2 doesn't seem to render most of these fonts unusable. Keith PackardXFree86 Core TeamHP Cambridge Research Lab ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
Kaixo! On Sat, Jun 29, 2002 at 01:32:36PM -0700, Keith Packard wrote: > These language tags come from the OS/2 table and are set by the font > designer. If, as our friend Jungshik Shin says, simplified forms were > not unified with traditional forms in the BMP, then it's quite reasonable > to build a font that can cover both languages. Yes, in fact only "zh" would be enough as a language tag. There are real differences in typographic traditions between Chinese and Japanese, so even when viewing a same character you can in some cases tell if it has been extracted from a Chinese or a Japanese publication. The differences between traditional/simplified aside, I don't think there are typographic traditions differences between zh_CN and zh_TW; it is possible to design a typeface suitable for both. It is not possible to design a typeface suitable for ja and zh. The difference between zh_CN and zh_TW as language tags is however useful, because a big amount of fonts only cover one of the two sets. -- Ki ça vos våye bén, Pablo Saratxaga http://chanae.stben.be/pablo/ PGP Key available, key ID: 0xD9B85466 [you can write me in Walloon, Spanish, French, English, Italian or Portuguese] ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
Kaixo! On Sat, Jun 29, 2002 at 01:20:34PM -0700, Keith Packard wrote: > > A font is suited for a given language when it covers *ALL* of the codepoints > > needed for that language. > > Yes, that's obviously true, but the problem is that I don't have tables for > each language indicating the required codepoints, all I have are tables > listing Unicode values in encodings traditionally used for each language. > These tables almost always include a few (1-5) glyphs which many fonts are > missing. What are those glyphs? (I'm quite surprised, I would have expected the opposite: fonts generally have more glyphs than the standard encodings of the sio-8859 family for example) >> So, the tests for CJK languages and for other languages are clearly different, >> only CJK languages can go with testing only a "signifiant fraction", >> for all other languages all chars must be tested. > > Yes, the tolerance value given for the Han languages is 500 codepoints > while the value for non-Han languages is two orders of magnitude smaller. No, the tolerance for missing glyphs in CJK tests should be the same or even smaller. The difference is that it isn't needed to test all the glyphs for CJK coverages; testing only a set of 256 choose glyphs would be enough (if they are correctly choosen, testing that 256 glyphs are present in a font is enough to assure, with 99.99% of confidence, that it covers a given CJK language). That cannot be done for the 8bit latin/cyrillic encodings because there is too much overlapping between them (in the case of iso-8859-1/iso-8859-15 the overlapping is of 97% for example). While there is also a lot of overlapping between CJK encodings, there are large plages of non overlaping chars, chars that appear only in the japanese encoding, or only in gb2312, or only in big5 etc. (I mean by "only": "not in any other widely used legacy encoding", so explicitely excluding unicode that of course includes them all). As those "exclusive" chars are numerous enough it is possbile to test for the presence of some of them in a font and determine a language coverage from there. Of course, complete checking can also be done, but I wonder if it is actually useful (I mean, is there a font suitable for simplified chinese out there that doesn't encode all the characters of gb2312? It would be like a font for English that is missing the "r" letter). "Big5" is a bit more problematic, as there is no such a thing as a well defined "Big5" encoding, but rather, in the pure Microsoftian tradition (big5 comes after all from that side) a number of revisions all named the same, that adds some characters, and an older font can miss some chars that a newer one has (according to a newer definition of "big5"). But to handle such case, I think it would be better to choose a given definition of "big5" (or several of them) and stick to it, rather than allowing a so tremendously big hole as 500 possible missing chars. -- Ki ça vos våye bén, Pablo Saratxaga http://chanae.stben.be/pablo/ PGP Key available, key ID: 0xD9B85466 [you can write me in Walloon, Spanish, French, English, Italian or Portuguese] ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
Around 14 o'clock on Jun 29, Yao Zhang wrote: > Sure, I will install as many Chinese fonts as possible and get the > fonts.cache for you. But before that, I will show you serveral lines in my > fonts.cache: I'm afraid the mailers corrupted the rather long lines in those files, but given that I've discovered that GB2312 is a relatively strong test for suitability for simplified chinese, perhaps we can avoid sending this data at all. > Now for lang, ZYSong18030 is labelled as > lang=simplifiedchinese > while SimSun-18030 is labelled as > >lang=latin1,arabic,simplifiedchinese,koreanwansung,traditionalchinese,koreanjohab,arabic864,arabicasmo708,us These language tags come from the OS/2 table and are set by the font designer. If, as our friend Jungshik Shin says, simplified forms were not unified with traditional forms in the BMP, then it's quite reasonable to build a font that can cover both languages. With the new improved GB2312-based simplified test, I suspect the correct languages would be generated automatically from this font as well. I've gone ahead and committed the changes necessary for automatic lang determination to XFree86 CVS; those interested in verifying it's sensitivity and specificity are welcome to check it out and run: $ FC_DEBUG=256 fc-cache -f This will display the number of missing glyphs in each language for each font and also display errors in the lang value relative to that specified in the TrueType file. Keith PackardXFree86 Core TeamHP Cambridge Research Lab ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
Around 20 o'clock on Jun 29, Pablo Saratxaga wrote: > A font is suited for a given language when it covers *ALL* of the codepoints > needed for that language. Yes, that's obviously true, but the problem is that I don't have tables for each language indicating the required codepoints, all I have are tables listing Unicode values in encodings traditionally used for each language. These tables almost always include a few (1-5) glyphs which many fonts are missing. So, the test is to require that the number of missing glyphs for non-Han languages is very small (<8) to allow fonts which happen to be missing only a few unimportant glyphs to be used. Discovering which glyphs in each encoding are problematic in many fonts would allow this fudge factor to be reduced further. > So, the tests for CJK languages and for other languages are clearly different, > only CJK languages can go with testing only a "signifiant fraction", > for all other languages all chars must be tested. Yes, the tolerance value given for the Han languages is 500 codepoints while the value for non-Han languages is two orders of magnitude smaller. Keith PackardXFree86 Core TeamHP Cambridge Research Lab ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
Around 13 o'clock on Jun 29, Yao Zhang wrote: > if (covers_much_of (gb18030)) > font supports simplified Chinese > if (covers_almost_all_of (Big5)) > font supports traditional Chinese > font does not support simplified Chinese > > For a GB18030 font, since it covers much of GB18030 set, it suports > simplified Chinese. And is also covers almost all of BIG5, so it > supports traditional Chinese too. But now the algorithm excludes it > from simplified Chinese support. The last line is wrong. Yes, I think the problem is that I'm using GBK for the test instead of GB2312 -- I got the simplified coverage information from codepage 936 which is based on GBK. The fonts I have don't cover most of GBK, but do cover nearly all of GB2312. > if (covers_almost_all_of (GB2312)) > font supports SIMPLIFIED Chinese > if (covers_almost_all_of (Big5)) > font supports traditional Chinese Thanks, this works just fine. I'm much happier with this solution. Keith PackardXFree86 Core TeamHP Cambridge Research Lab ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
On Sat, 29 Jun 2002, Yao Zhang wrote: > It should be > > if (covers_almost_all_of (GB2312)) > font supports SIMPLIFIED Chinese > if (covers_almost_all_of (Big5)) > font supports traditional Chinese After sending my prev. message, I read this and I have to agree with this. This is better than what I sent earlier. Just forgetting about GB18030/GBK coverage and concentrating on GB2312 and Big5 coverage is simpler as well as better. Jungshik Shin ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
[Fonts]Han unification(SC and TC)(was..Re: Automatic 'lang' determination)
On Sat, 29 Jun 2002, Keith Packard wrote: Ooops. My message crossed yours in mail :-) > Around 9 o'clock on Jun 29, Jungshik Shin wrote: > > IMHO, most problems with Han Unification arise not from using a _single_ > > font targeted at one of zh_TW/zh_CN/ja/ko to render a run of text in > > another but from mixing _multiple_ fonts (with _drastically different_ > > design principle and other differences like baseline) to render a single ... > Yes, I agree -- this is true in Western languages as well where the We agree with each other on this point, but still get to different conclusions about zh-CN and zh-TW. I'm afraid that's because you have been misinformed about what Han unification has done about simplified forms and traditional forms of Chinese characters. > > Suppose there's a document tagged as zh_TW that explains how PRC government > > simplified Chinese characters to boost the literacy rate after WW II. If a > > Big5 font (that doesn't cover all characters in the doc) is selected > > instead of a GBK/GB18030 font (with the full coverage), simplified Han > > characters(not used in Taiwan but only used in PRC) in the doc have to be > > rendered with another font (most likely GB2312/GBK/GB18030 font). > > A correct version of this document would tag individual sections of the > document with appropriate tags. This way, the zh_TW sections could be > presented in a traditional Chinese font while the mainland portions are > displayed with simplified Chinese glyphs. Well, even without language tagging, that would happen, which I regard as _ugly_ for the reason I gave in my previous message. Language tag or not, the result would be just as ugly as using TimesRoman Latin-1 font for most characters with a couple of characters rendered with Palatino Latin-2 font. My hypothetical document would not have separate sections for zh-TW and zh-CN, but rather occasional simplified forms of Chinese characters (absent in Big5 fonts but present in GB2312/GBK/GB18030 fonts) would pop up among traditional forms of Chinese characters (present in _both_ Big5 font and GBK/GB18030 fonts). IMHO, tagging the whole document as 'zh-TW' is perfectly valid and rendering it with GBK/GB18030 (with the full coverage of characters in the document) is better than mixing two fonts, one with Big5 coverage and the other with GBK/GB18030 coverage. The latter would happen if you exclude GBK/GB 18030 fonts for zh-TW text rendering. Tagging individual simplified forms of Chinese characters with 'lang=zh-CN' in the sea of traditional forms of Chinese characters would only lead to a less-desirable result than otherwise possible. > > I'm not sure what you meant by 'glyph forms are more likely > > simplified'. You might have misunderstood some aspects of Han Unification > > in Unicode/10646. In Unicode, simplified forms of Chinese characters are > > NOT unified with corresponding traditional forms of Chinese characters. > > You're right -- I didn't believe this to be the case. I had heard that the > unified portion within the BMP do co-mingle simplified and traditional > forms, but that the non-BMP Han extension provide separate codepoints for > each. I'm afraid what you have heard of BMP section is misleading if I understood you correctly. Whether in BMP or not, simplified forms of Chinese characters are NOT UNIFIED with traditional forms of Chinese characters. (let me copy my message to John H. Jenkins @Apple who knows a lot more about Han Unification than I do.) AFAIK, most complaints about Han unification does NOT come from zh-CN vs zh-TW BUT from zh-CN/zh-TW vs ja. For Han characters common in both zh-CN and zh-TW, there's no significant difference in appearence between zh-CN and zh-TW. Although many Japanese would not agree with me, I don't think there's any significant difference across CJKV. (again, ISO 10646 Han chart is a good reference along with ROC MOE's Han character variant dictionary at http://140.111.1.40) To me, Han Unification should have gone further (not less) in a sense and it's worrisome to me that non-BMP includes too many glyph variants (a whole bunch of them coming from Korean Buddist text : see http://www.sutra.re.kr) that should have been unified in my eyes. > If even BMP codepoints are separate, > then it should be possible to create > a large set of codepoints which could mark fonts as suitable for the > display of simplified Chinese which are distinct from the set of > codepoitns suitable for the display of traditional Chinese. That would > be nicer than my current kludge of marking any font suitable for > traditional chinese as unsuitable for simplified Chinese. How about this? if covers most of GB 18030 good for both zh-CN and zh-TW (and possibly good for ko) elif covers most of GBK good for both zh-CN and zh-TW (and possibly good for ko) not good for ja elif covers most of Big5, good for zh-TW (and possibly good for ko)
Re: [Fonts]Automatic 'lang' determination
Keith Packard wrote: > Actually, I could really use as many Han fonts as you have, especially if > they are from different vendors and of different ages. All I really need > is the fonts.cache files generated from these fonts; that holds the unicode > coverage and any OS/2 table information. That would be a lot smaller, and > also avoid any copyright or trade secret problems. Sure, I will install as many Chinese fonts as possible and get the fonts.cache for you. But before that, I will show you serveral lines in my fonts.cache: "/usr/share/fonts/zh_CN/TrueType/zysong.ttf" 0 1017360509 "ZYSong18030:style=regular:slant=0:weight=100:index=0:outline=True:scalable=True:charset=:lang=simplifiedchinese" "/usr/share/fonts/zh_CN/TrueType/SimSun18030.ttc" 0 1021954464 "SimSun\\-18030:style=regular:slant=0:weight=100:spacing=100:index=0:outline=True:scalable=True:charset= |>^1!|>^1!P0oWQ |>^1!|>^1!|>^1%#$XIJ7!!7K/!#@#g!BBH1!!K?& )rmR!!^^7$!!!)$ !!71$$ 9;+63 !!!.%|>J~~|>K0}!!!0~ !!!1&|>T)$|>^1!!!B7$ !!!7)RfF}m#|7NW!!!?*;5CsY!BB.k9WOSb!%TBD !!!T4|>^1!|>^1!|>^+~|>K?){{7T3q~Ki]!!(bt !!!r?#?7uT|>^1!|>^1!!BB.!|>^11% !!#0GMHs3p&VK |;y1s(1+e4 !!#AL|>^1!|>^1!|>T^4!#f04!)*$a4LXyi!!*.[f!!#DM!!!*2 ( !!#]U !2bz#$oxJj!!!1& !!#bV (0~]4!!#eWF3y>z9WIxl|>^0~|>^1!MX|rY|>^0~|>^1!K2Fxo!!#hX|>^0^!!!1% !!#kY !!7?( + !!#nZJ~mcX$!&){H !!#q[|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!#t]|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!#w^|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!#za|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!#}b|>^1!|>^1!|>^1!|>^! 1!|>^1!|>^1!|>^1!|>^1!!!$#c|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$&d|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$*e|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$/f|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$2g|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$5h|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$8i|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$@k|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Cl|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Fm|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$In|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Lo|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Op|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Rq|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Ur|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Xs|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$[t|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$au|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$dv|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$gw|>^1!|>^1!|>^1!|>^1!|>^1!!)pSi !!$jx|>^1!|>^1!|>^1!|>^1!|>^1!|>! ^1!|>^1!|>^1!!!$my|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$pz| >^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$s{|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$v||>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$y}|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$|~|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%#!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%*$|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%/%|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%2&|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%5(|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%8)|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%<*|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%@+|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%C.|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%F/|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%I0|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%L1|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%O2|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%R3|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%U4|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%X5|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%[6|>! ^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%a7|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%d8|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%g9|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%j;|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%m<|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%p>|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%s?|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%v@|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%yA|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%|B|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&!C|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&%D|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&)E|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&.F|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&1G|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&4H|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&7I|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&;J|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&?K|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&BL|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&EM|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&HN|>^! 1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&KO|>^1!|>^1!|>^1!|>^1!|> ^1!|>^1!|>^1!|>^1!!!&NP|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&QQ|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&TR|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&WS|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&ZT|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&^U|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>
Re: [Fonts]Automatic 'lang' determination
Keith Packard wrote: > Actually, I could really use as many Han fonts as you have, especially if > they are from different vendors and of different ages. All I really need > is the fonts.cache files generated from these fonts; that holds the unicode > coverage and any OS/2 table information. That would be a lot smaller, and > also avoid any copyright or trade secret problems. Sure, I will install as many Chinese fonts as possible and get the fonts.cache for you. But before that, I will show you serveral lines in my fonts.cache: "/usr/share/fonts/zh_CN/TrueType/zysong.ttf" 0 1017360509 "ZYSong18030:style=regular:slant=0:weight=100:index=0:outline=True:scalable=True:charset=:lang=simplifiedchinese" "/usr/share/fonts/zh_CN/TrueType/SimSun18030.ttc" 0 1021954464 "SimSun\\-18030:style=regular:slant=0:weight=100:spacing=100:index=0:outline=True:scalable=True:charset= |>^1!|>^1!P0oWQ |>^1!|>^1!|>^1%#$XIJ7!!7K/!#@#g!BBH1!!K?& )rmR!!^^7$!!!)$ !!71$$ 9;+63 !!!.%|>J~~|>K0}!!!0~ !!!1&|>T)$|>^1!!!B7$ !!!7)RfF}m#|7NW!!!?*;5CsY!BB.k9WOSb!%TBD !!!T4|>^1!|>^1!|>^+~|>K?){{7T3q~Ki]!!(bt !!!r?#?7uT|>^1!|>^1!!BB.!|>^11% !!#0GMHs3p&VK |;y1s(1+e4 !!#AL|>^1!|>^1!|>T^4!#f04!)*$a4LXyi!!*.[f!!#DM!!!*2 ( !!#]U !2bz#$oxJj!!!1& !!#bV (0~]4!!#eWF3y>z9WIxl|>^0~|>^1!MX|rY|>^0~|>^1!K2Fxo!!#hX|>^0^!!!1% !!#kY !!7?( + !!#nZJ~mcX$!&){H !!#q[|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!#t]|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!#w^|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!#za|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!#}b|>^1!|>^1!|>^1!|>^! 1!|>^1!|>^1!|>^1!|>^1!!!$#c|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$&d|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$*e|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$/f|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$2g|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$5h|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$8i|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$@k|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Cl|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Fm|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$In|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Lo|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Op|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Rq|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Ur|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Xs|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$[t|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$au|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$dv|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$gw|>^1!|>^1!|>^1!|>^1!|>^1!!)pSi !!$jx|>^1!|>^1!|>^1!|>^1!|>^1!|>! ^1!|>^1!|>^1!!!$my|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$pz| >^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$s{|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$v||>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$y}|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$|~|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%#!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%*$|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%/%|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%2&|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%5(|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%8)|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%<*|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%@+|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%C.|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%F/|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%I0|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%L1|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%O2|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%R3|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%U4|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%X5|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%[6|>! ^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%a7|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%d8|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%g9|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%j;|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%m<|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%p>|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%s?|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%v@|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%yA|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%|B|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&!C|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&%D|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&)E|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&.F|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&1G|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&4H|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&7I|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&;J|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&?K|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&BL|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&EM|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&HN|>^! 1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&KO|>^1!|>^1!|>^1!|>^1!|> ^1!|>^1!|>^1!|>^1!!!&NP|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&QQ|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&TR|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&WS|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&ZT|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&^U|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>
Re: [Fonts]Automatic 'lang' determination
Kaixo! On Sat, Jun 29, 2002 at 09:34:43AM -0700, Keith Packard wrote: > This goal is reflected in the design I outlined -- fonts are deemed > "suitable" for a particular language when they cover a significant > fraction of the codepoints commonly associated with that language. That is inacceptable. A font is suited for a given language when it covers *ALL* of the codepoints needed for that language. The only exception in checking *all* of the needed codepoints is that of CJK languages, that is because: - there is a very small set of such languages - the fonts are designed with coverage of one of them in mind - the mandatory glyphs needed for a given CJK language that don't overlap with any other CJK language make a quit big set, allowing to test just a carefully chose and small set of glyphs, and assume that all other glyphs needed for a given CJK language are present too. Maybe also scripts used for one and only one language can be handled withotu the need to check all the needed codepoints (but on the other hand they always form a small amount of codepoints, so checking them all is not a problem) But for the big majority of languages, that are not the only ones written with a given script, just checking coverage of a "signifiant fraction" is not enough. For example Spanish, it needs the a-z letters plus áéíóúüñ (that is, aacute, eacute, iacute, oacute, uacute, udiaeresis and ntilde). If only one of these is missing then you cannot render a Spanish text correctly, even if out of the 66 chars (33 lowercase, 33 upercase) the font covers 65 of them, it is still not suitable to properly render Spanish text (it may get unnoticed if the text just happens to don't use the missing letter, but relying in chance is not very serious) So, the tests for CJK languages and for other languages are clearly different, only CJK languages can go with testing only a "signifiant fraction", for all other languages all chars must be tested. > > Suppose there's a document tagged as zh_TW that explains how PRC government > > simplified Chinese characters to boost the literacy rate after WW II. If a > > Big5 font (that doesn't cover all characters in the doc) is selected > > instead of a GBK/GB18030 font (with the full coverage), simplified Han > > characters(not used in Taiwan but only used in PRC) in the doc have to be > > rendered with another font (most likely GB2312/GBK/GB18030 font). > > A correct version of this document would tag individual sections of the > document with appropriate tags. This way, the zh_TW sections could be > presented in a traditional Chinese font while the mainland portions are > displayed with simplified Chinese glyphs. Indeed. I wonder however how place names are handled. Are there place names with names using hanzi that don't exist in simplified form ? If so, what would be the preferred solution to write such a place name in a simplified Chinese text ? Same question for people names. -- Ki ça vos våye bén, Pablo Saratxaga http://chanae.stben.be/pablo/ PGP Key available, key ID: 0xD9B85466 [you can write me in Walloon, Spanish, French, English, Italian or Portuguese] ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
I wrote earlier: > Actually, it is better changed to > if (covers_almost_all_of (GB2312)) > font supports traditional Chinese > if (covers_almost_all_of (Big5)) > font supports traditional Chinese It should be if (covers_almost_all_of (GB2312)) font supports SIMPLIFIED Chinese if (covers_almost_all_of (Big5)) font supports traditional Chinese Sorry about the typo. ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
From: Keith Packard <[EMAIL PROTECTED]> > Around 22 o'clock on Jun 29, Yu Shao wrote: > > >Tagging GB18030 fonts as suitable for traditional chinese seems like a > > >mistake; the glyph forms are more likely simplified, and it would be > > > > > Agreed. > This is reassuring. No, this is not the case. Let us use Unicode terms here, because those national standard are missleading. GB18030 is a PRC standard, but it doesn't means it is for simplified Chinese. Actually, all those fonts use Unicode CMAP, so they are really Unicode font. For Han characters, GB18030 covers CJK Unified Ideographs and its extension A. GBK covers CJK Unified Ideographs only. Roughly speaking, CJK Unified Ideographs covers both GB2312 and BIG5 character set. The simplifed and traditional forms are NOT unified. So both GBK and GB18030 fonts are suitable for simplified Chinese and traditional Chinese. No, the algorithm is not quite right: if (covers_much_of (gb18030)) font supports simplified Chinese if (covers_almost_all_of (Big5)) font supports traditional Chinese font does not support simplified Chinese For a GB18030 font, since it covers much of GB18030 set, it suports simplified Chinese. And is also covers almost all of BIG5, so it supports traditional Chinese too. But now the algorithm excludes it from simplified Chinese support. The last line is wrong. Actually, it is better changed to if (covers_almost_all_of (GB2312)) font supports traditional Chinese if (covers_almost_all_of (Big5)) font supports traditional Chinese Regards, Yao Zhang ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
On Sat, 29 Jun 2002, Jungshik Shin wrote: > On Fri, 28 Jun 2002, Keith Packard wrote: > > I'm confused by this; my exposure to Chinese fonts says that simplified > > Chinese and traditional Chinese have significant overlap in Unicode > > codepoints, but that the glyphs are quite a bit different in appearance. > > I doubt this is the case. As far as I can tell I found this needs some clarification. If glyphs of 'A', 'B' and 'C' from Times Roman Latin-1 font are compared with corresponding glyphs from New Century Schoolbook Latin-2 font, they look certainly different. However, that does not mean that you cannot use Times Roman Latin-1 font to render a run of text in one of languages Latin-2 is meant for as long as Times-Roman Latin-1 font has _all_ the glyphs necessary in that particular run of text. I believe the same thing can happen between two fonts for zh-TW and zh-CN. If glyphs from font A for zh-TW are compared with glyphs from font B (with different design principles) for zh-CN, they for sure look different. However, they're different not because font A is for zh-TW and font B is for zh-CN but because they're designed to appear different. > > Chinese and traditional Chinese have significant overlap in Unicode > > codepoints, but that the glyphs are quite a bit different in appearance. To make this kind of comparison meaningful, you have to compare two fonts, one for zh-TW and the other for zh-CN, made by a _single_ foundry with the _identical_ design principles and look and feel (something like Adobe Times Roman Latin-1 font and Adobe Times Roman Latin-2 font). In practice, it's hard to find two fonts that satisfy the crieteria I outlined here. However, ISO 10646 code charts for Han characters should do almost as good a job. That's why I suggested comparing glyphs for PRC and Taiwan in the ISO 10646 Han character chart. Jungshik Shin ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
Around 9 o'clock on Jun 29, Jungshik Shin wrote: > IMHO, most problems with Han Unification arise not from using a _single_ > font targeted at one of zh_TW/zh_CN/ja/ko to render a run of text in > another but from mixing _multiple_ fonts (with _drastically different_ > design principle and other differences like baseline) to render a single > run of text (say, 65% of characters drawn from one font, 25% from a second > font, 7% from a third font, etc). Yes, I agree -- this is true in Western languages as well where the application selects a font covering only Latin-1 and attempts to display text requiring glyphs from Latin-2; a "smart" application will locate an additional font to fill-in the missing glyphs, the result looks like a ransom note. The hope is that proper language tags in the document can avoid this at the start by making the first font contain the proper coverage for the entire block of text. This goal is reflected in the design I outlined -- fonts are deemed "suitable" for a particular language when they cover a significant fraction of the codepoints commonly associated with that language. > Suppose there's a document tagged as zh_TW that explains how PRC government > simplified Chinese characters to boost the literacy rate after WW II. If a > Big5 font (that doesn't cover all characters in the doc) is selected > instead of a GBK/GB18030 font (with the full coverage), simplified Han > characters(not used in Taiwan but only used in PRC) in the doc have to be > rendered with another font (most likely GB2312/GBK/GB18030 font). A correct version of this document would tag individual sections of the document with appropriate tags. This way, the zh_TW sections could be presented in a traditional Chinese font while the mainland portions are displayed with simplified Chinese glyphs. I don't know how prevalent language tagging is in office document formats, but it's certainly available in HTML. It's the HTML case that started my journey into language tags. > I'm not sure what you meant by 'glyph forms are more likely > simplified'. You might have misunderstood some aspects of Han Unification > in Unicode/10646. In Unicode, simplified forms of Chinese characters are > NOT unified with corresponding traditional forms of Chinese characters. You're right -- I didn't believe this to be the case. I had heard that the unified portion within the BMP do co-mingle simplified and traditional forms, but that the non-BMP Han extension provide separate codepoints for each. If even BMP codepoints are separate, then it should be possible to create a large set of codepoints which could mark fonts as suitable for the display of simplified Chinese which are distinct from the set of codepoitns suitable for the display of traditional Chinese. That would be nicer than my current kludge of marking any font suitable for traditional chinese as unsuitable for simplified Chinese. Keith PackardXFree86 Core TeamHP Cambridge Research Lab ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
Around 22 o'clock on Jun 29, Yu Shao wrote: > >Tagging GB18030 fonts as suitable for traditional chinese seems like a > >mistake; the glyph forms are more likely simplified, and it would be > > > Agreed. This is reassuring. > As gb18030 is compulsory from government, I think we should just treat > gb18030 as Simplified Chinese, and all fonts from now on should gb18030 > compliant. For these fonts, the new included Chinese minority Yi and > Tibeitan characters would do. The trick that I use to distinguish between simplified Chinese and traditional Chinese targeted fonts is not whether they cover a significant fraction of the Unicode codepoints mapped from gb18030, but whether they cover nearly all of the Unicode codepoints from mapped from Big5. The algorithm looks like: if (covers_much_of (gb18030)) font supports simplified Chinese if (covers_almost_all_of (Big5)) font supports traditional Chinese font does not support simplified Chinese if (covers_almost_all_of (JIS)) font supports Japanese font does not support simplified Chinese if (covers_almost_all_of (Korean Wansung)) font supports Korean font does not support simplified Chinese Nearly all Han fonts cover as much of GB18030 as those targeted for simplified Chinese, but (in my limited sample) simplified Chinese fonts cover only a small fraction of all of the other Han encodings. Except for Arial Unicode, which covers all of the encodings nearly completely. Remember that this whole mess is only needed for fonts which don't have any OS/2 codePageRange bits set; the hope is that new fonts covering more of the Unicode range will be provided in TrueType or OpenType format so that this particular hack can be avoided. > But the very popular Microsoft's Chinese simsun font now, is actually a > gbk font. This is a TrueType font and so the above hacks don't apply. Are there new GB18030 fonts being distributed in formats that don't include the OS/2 codePageRange bits? Keith PackardXFree86 Core TeamHP Cambridge Research Lab ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
On Fri, 28 Jun 2002, Keith Packard wrote: > > Around 0 o'clock on Jun 29, Yao Zhang wrote: > > > A GB18030 font (covers CJK Unified Ideographs and its extension A in Unicode > > terms) should really be labeled as > > Simplified Chinese AND Traditional Chinese > > while fonts with GB2312 coverage should be labeled as > > Simplified Chinese > > and BIG5 coverage should be labeled as > > Traditional Chinese > > I'm confused by this; my exposure to Chinese fonts says that simplified > Chinese and traditional Chinese have significant overlap in Unicode > codepoints, but that the glyphs are quite a bit different in appearance. I doubt this is the case. As far as I can tell from ISO 10646 (unlike Unicode, for a single Han character, ISO 10646 lists glyphs as _commonly_ used in PRC, Taiwan, Japan, ROK, and Vietnam. ISO 10646:2 also lists DPRK glyphs), characters common in GB2312(SC) and Big5(TC) do not have big enough difference (if there's any difference at all) in glyphs to make using a _single_ font(say, GB18030/GBK fonts) for both zh_CN and zh_TW undesirable. IMHO, most problems with Han Unification arise not from using a _single_ font targeted at one of zh_TW/zh_CN/ja/ko to render a run of text in another but from mixing _multiple_ fonts (with _drastically different_ design principle and other differences like baseline) to render a single run of text (say, 65% of characters drawn from one font, 25% from a second font, 7% from a third font, etc). I'm not saying there's no problem at all using a TC font for Japanese text rendering. I'm well aware that many Japanese don't like that. However, using GBK/GB18030 fonts for TC should present much much less problem than that. > I'm not interested in discovering which fonts can display a particular > document; that's easily done with Unicode coverage. What I'm interested in > is selecting the font best suited for presenting data tagged for a > particular language. I believe Yao's well aware of your interest here. What he meant is that using GBK/GB18030 fonts for both SC and TC rendering is all right. It could be even desirable in some cases. Suppose there's a document tagged as zh_TW that explains how PRC government simplified Chinese characters to boost the literacy rate after WW II. If a Big5 font (that doesn't cover all characters in the doc) is selected instead of a GBK/GB18030 font (with the full coverage), simplified Han characters(not used in Taiwan but only used in PRC) in the doc have to be rendered with another font (most likely GB2312/GBK/GB18030 font). Even though font selection routine does a pretty good job of picking two fonts(Big5 font and GB2312/GBK/GB18030) with similar look and feel, there may be a subtle but noticable difference between two. If GBK/GB18030 font is used to render _all_ Han characters in the doc., this wouldn't be an issue and the result would give a uniform and consistent look and feel. > Tagging GB18030 fonts as suitable for traditional chinese seems like a > mistake; the glyph forms are more likely simplified, and it would be > preferable to use a traditional chinese font, if any is available. Of I'm not sure what you meant by 'glyph forms are more likely simplified'. You might have misunderstood some aspects of Han Unification in Unicode/10646. In Unicode, simplified forms of Chinese characters are NOT unified with corresponding traditional forms of Chinese characters. If GB2312 and Big5 have some characters in common, that's because PRC didn't simplify them and just decided to use traditional forms. Jungshik Shin ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
Keith Packard wrote: >Around 0 o'clock on Jun 29, Yao Zhang wrote: > >>A GB18030 font (covers CJK Unified Ideographs and its extension A in Unicode >>terms) should really be labeled as >>Simplified Chinese AND Traditional Chinese >>while fonts with GB2312 coverage should be labeled as >>Simplified Chinese >>and BIG5 coverage should be labeled as >>Traditional Chinese >> > >I'm confused by this; my exposure to Chinese fonts says that simplified >Chinese and traditional Chinese have significant overlap in Unicode >codepoints, but that the glyphs are quite a bit different in appearance. > >I'm not interested in discovering which fonts can display a particular >document; that's easily done with Unicode coverage. What I'm interested in >is selecting the font best suited for presenting data tagged for a >particular language. > >Tagging GB18030 fonts as suitable for traditional chinese seems like a >mistake; the glyph forms are more likely simplified, and it would be > Agreed. > >preferable to use a traditional chinese font, if any is available. Of >course, when no traditional chinese font is present, the system will >search for *any* font which does cover those codepoints, substituting in >an available simplified chinese font. > >I believe I've found a relatively robust way of distinguishing fonts >designed for traditional chinese from those designed for simplified >chinese; traditional chinese fonts cover most of BIG5 while simplified >chinese fonts don't. Both cover similar amounts of GB18030; as you say, >that encoding is enormous. > >What I didn't investigate is whether the simplified chinese fonts cover >*different* parts of GB18030 than the traditional fonts. That might make > As gb18030 is compulsory from government, I think we should just treat gb18030 as Simplified Chinese, and all fonts from now on should gb18030 compliant. For these fonts, the new included Chinese minority Yi and Tibeitan characters would do. But the very popular Microsoft's Chinese simsun font now, is actually a gbk font. > >the determination easier; simply use the subset of GB18030 normally needed >to present simplified chinese documents as the touchstone instead of the >whole encoding. For that to work, I'd need a lot more simplified chinese >fonts from various vendors. > >>If you need those fonts for testing, I will send you one typical font >>in each category (They are huge, at lease several MB in size). For >>example, >> > >Actually, I could really use as many Han fonts as you have, especially if >they are from different vendors and of different ages. All I really need >is the fonts.cache files generated from these fonts; that holds the unicode >coverage and any OS/2 table information. That would be a lot smaller, and >also avoid any copyright or trade secret problems. > >Keith PackardXFree86 Core TeamHP Cambridge Research Lab > > >___ >Fonts mailing list >[EMAIL PROTECTED] >http://XFree86.Org/mailman/listinfo/fonts > ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]FreeType 2 backend for the masses
JP> [...] either the instructions or the sources should be adapted. I was only checking if anyone's paying attention ;-) Nice to see you back, Joerg. Juliusz ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
[Fonts]i18n fixes in Xlib, A.1141 improvement
Hello, The attached patch improve my submission to <[EMAIL PROTECTED]> with sequence number A.1141 (and subject:) The new code vs A.1141 add one line of code + some comments which answer a question of O. Taylor about adding a "break" in the code. I explain why this "break" should be added (Moreover, this fix a memory leak and fix slow font loading with a base font name with a long list of fonts). I cc this mail to [EMAIL PROTECTED] as I think that some part of the Xlib i18n code need a real clean up (the patch contains only fixes). I would like to know if someone maintains or works on this part of the Xlib code and if I have to start this clean up. Probably an other mailing list should be use but as I am not in the xfree86 team (first fix send) I cc to this mailing list. Here a full change log: * xc/lib/X11/omGeneric.c (destroy_fontdata): Free a XFontStruct which should be but was not * xc/lib/X11/omGeneric.c (parse_vw): (parse_fontname): Fixed minor memory leaks * xc/lib/X11/omGeneric.c (parse_fontname): break when a match is found * xc/lib/X11/lcFile.c (_XlcLocaleDirName): Fixed minor memory leaks Regards, Olivier PS: - patch done in xc/lib/X11 with cvs diff -u - Do not be afraid my C is better than my English - Please apply the patch it fixes quite dramatic bugs IMHO Index: lcFile.c === RCS file: /cvs/xc/lib/X11/lcFile.c,v retrieving revision 3.26 diff -u -r3.26 lcFile.c --- lcFile.c2002/05/31 18:45:42 3.26 +++ lcFile.c2002/06/29 07:55:43 @@ -421,12 +421,18 @@ sprintf(buf, "%s/locale.dir", target_dir); target_name = resolve_name(name, buf, RtoL); } +if (name != NULL && name != lc_name) { + XFree(name); + name = NULL; +} if (target_name != NULL) { char *p = 0; if ((p = strstr(target_name, "/XLC_LOCALE"))) { *p = '\0'; break; } + XFree(target_name); + target_name = NULL; } } if (target_name == NULL) { @@ -437,5 +443,8 @@ strcpy(dir_name, target_dir); strcat(dir_name, "/"); strcat(dir_name, target_name); + if (target_name != lc_name) { + XFree(target_name); + } return dir_name; } Index: omGeneric.c === RCS file: /cvs/xc/lib/X11/omGeneric.c,v retrieving revision 3.20 diff -u -r3.20 omGeneric.c --- omGeneric.c 2001/04/05 17:42:26 3.20 +++ omGeneric.c 2002/06/29 07:55:48 @@ -1056,6 +1056,22 @@ * * Owen Taylor <[EMAIL PROTECTED]> 12 Jul 2000 */ + /* The reason why this routine modifies font_data and has a +* font_data_return is that if it is called with C_PRIMARY, then +* font_data_return is used by the caller and with the others classes +* font_data is used by the caller (font_data can be different +* than font_data_return if we do not break here). +* However, a close look at the code (e.g., the drawing funcs) shows +* that breaking or not here change nothing!! +* So we should 'break' here and the code needs a clean-up (e.g., +* some FontStruct are loaded and _never_ used). +* Hopefully this also fix a memory leak: if we do not break here +* a found a match later font_data->xlfd_name is deferenced without +* being freed. Finally, this speed up font loading. +* +* <[EMAIL PROTECTED]> 2002-06-29 +*/ + break; } switch(class) { @@ -1126,13 +1142,21 @@ intret = 0, i = 0; if(vmap_num > 0) { - if(parse_fontdata(oc, font_set, vmap, vmap_num, name_list, count, C_VMAP) == -1) + if(parse_fontdata(oc, font_set, vmap, vmap_num, name_list, count, + C_VMAP, &font_data_return) == -1) { + if(font_data_return.xlfd_name != NULL) +XFree(font_data_return.xlfd_name); return (-1); + } + if(font_data_return.xlfd_name != NULL) + XFree(font_data_return.xlfd_name); } if(vrotate_num > 0) { ret = parse_fontdata(oc, font_set, (FontData) vrotate, vrotate_num, name_list, count, C_VROTATE, &font_data_return); + if(font_data_return.xlfd_name != NULL) + XFree(font_data_return.xlfd_name); if(ret == -1) { return (-1); } else if(ret == False) { @@ -1168,6 +1192,8 @@ ret = parse_fontdata(oc, font_set, (FontData) vrotate, vrotate_num, name_list, count, C_VROTATE, &font_data_return); + if(font_data_return.xlfd_name != NULL) + XFree(font_data_return.xlfd_name); if(ret == -1) return (-1); } @@ -1237,6 +1263,7 @@ font_set->side = font_data_return.side; Xfree (font_data_