Re: [Fonts]Automatic 'lang' determination
Kaixo! On Sat, Jun 29, 2002 at 05:17:04PM -0700, Keith Packard wrote: What are those glyphs? (I'm quite surprised, I would have expected the opposite: fonts generally have more glyphs than the standard encodings of the sio-8859 family for example) My definition of language tag is coloured by the OS/2 table codePageRange bits from which is was originally defined in fontconfig. Those bits are defined to map to specific Windows code pages; the Latin-1 case doesn't map to ISO 8859-1, but rather to code page 1252 for which many fonts are missing a few random entries. But what characters are those? It is possible that they are the onesthat have been added to cp1252 and that didn't existed some years ago? I think the matching should be done against the lowest denominator and be strict; or to give different weights to the miss of *letters* or other symbols (it may be more or less acceptable to get quotation marks from another font; bUt lEttErs frOm A dIffErEnt fOnts Is vErY UglY). No, the tolerance for missing glyphs in CJK tests should be the same or even smaller. The difference is that it isn't needed to test all the glyphs for CJK coverages; testing only a set of 256 choose glyphs would be enough (if they are correctly choosen, testing that 256 glyphs are present in a font is enough to assure, with 99.99% of confidence, that it covers a given CJK language). I'm not confident enough of this approach; I fear that any set of 256 glyphs that must appear in a simplified Chinese font may well appear in many traditional Chinese (or even Japanese) fonts. Most do, of course, but there are a lot that don't. I only dealt with a ~10-15 ttf CJK fonts, but never had false positives using that method. out there that doesn't encode all the characters of gb2312? It seems that this must be the case -- I set the '500' number so high because all of the fonts which I have that advertise support for simplified Chinese are missing over 200 glyphs from GB2312. I got similar results for Japanese fonts, Korean Wansung fonts and traditional Chinese fonts. But what characters are those missing? Could it be that those are semi-graphic ones, or scripts used by other languages (eg: cyrillic, greek, japanese kana in chinese font, etc). Here too, different weights should be used, it is not a big problem if a CJK font is missing cyrillic, a font designed for russian will be a much better choice to render cyrillic anyway; but it may be a big problem if some needed characters are missing. And I'm really surprised by such a high number as 200. Are you sure you tested against gb2312 and not agains the Microsoft codepage based on it (that surely adds several extra characters) ? But to handle such case, I think it would be better to choose a given definition of big5 (or several of them) and stick to it, rather than allowing a so tremendously big hole as 500 possible missing chars. Missing 500 from a repertoire of nearly 2 doesn't seem to render most of these fonts unusable. It could, it depends on what glyphs are missing. -- Ki ça vos våye bén, Pablo Saratxaga http://chanae.stben.be/pablo/ PGP Key available, key ID: 0xD9B85466 [you can write me in Walloon, Spanish, French, English, Italian or Portuguese] ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
Pablo Saratxaga wrote: Kaixo! On Sat, Jun 29, 2002 at 05:17:04PM -0700, Keith Packard wrote: What are those glyphs? (I'm quite surprised, I would have expected the opposite: fonts generally have more glyphs than the standard encodings of the sio-8859 family for example) My definition of language tag is coloured by the OS/2 table codePageRange bits from which is was originally defined in fontconfig. Those bits are defined to map to specific Windows code pages; the Latin-1 case doesn't map to ISO 8859-1, but rather to code page 1252 for which many fonts are missing a few random entries. But what characters are those? It is possible that they are the onesthat have been added to cp1252 and that didn't existed some years ago? I think the matching should be done against the lowest denominator and be strict; or to give different weights to the miss of *letters* or other symbols (it may be more or less acceptable to get quotation marks from another font; bUt lEttErs frOm A dIffErEnt fOnts Is vErY UglY). No, the tolerance for missing glyphs in CJK tests should be the same or even smaller. The difference is that it isn't needed to test all the glyphs for CJK coverages; testing only a set of 256 choose glyphs would be enough (if they are correctly choosen, testing that 256 glyphs are present in a font is enough to assure, with 99.99% of confidence, that it covers a given CJK language). I'm not confident enough of this approach; I fear that any set of 256 glyphs that must appear in a simplified Chinese font may well appear in many traditional Chinese (or even Japanese) fonts. Most do, of course, but there are a lot that don't. I only dealt with a ~10-15 ttf CJK fonts, but never had false positives using that method. out there that doesn't encode all the characters of gb2312? It seems that this must be the case -- I set the '500' number so high because all of the fonts which I have that advertise support for simplified Chinese are missing over 200 glyphs from GB2312. I got similar results for Japanese fonts, Korean Wansung fonts and traditional Chinese fonts. But what characters are those missing? Could it be that those are semi-graphic ones, or scripts used by other languages (eg: cyrillic, greek, japanese kana in chinese font, etc). Here too, different weights should be used, it is not a big problem if a CJK font is missing cyrillic, a font designed for russian will be a much better choice to render cyrillic anyway; but it may be a big problem if some needed characters are missing. And I'm really surprised by such a high number as 200. Are you sure you tested against gb2312 and not agains the Microsoft codepage based on it (that surely adds several extra characters) ? Hi Keith, Checking against fontenc, Both AR PL SungtiL GB andAR PL KaitiM GB provide all GB2312's 7445 characters which include 6763 Hanzis and 682 symbols. fc-cache reports 204 missing seems not correct? Regards, But to handle such case, I think it would be better to choose a given definition of big5 (or several of them) and stick to it, rather than allowing a so tremendously big hole as 500 possible missing chars. Missing 500 from a repertoire of nearly 2 doesn't seem to render most of these fonts unusable. It could, it depends on what glyphs are missing. -- Yu Shao Red Hat Asia-Pacific +61 7 3872 4835 Legal: http://apac.redhat.com/disclaimer ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
Around 9 o'clock on Jun 29, Jungshik Shin wrote: IMHO, most problems with Han Unification arise not from using a _single_ font targeted at one of zh_TW/zh_CN/ja/ko to render a run of text in another but from mixing _multiple_ fonts (with _drastically different_ design principle and other differences like baseline) to render a single run of text (say, 65% of characters drawn from one font, 25% from a second font, 7% from a third font, etc). Yes, I agree -- this is true in Western languages as well where the application selects a font covering only Latin-1 and attempts to display text requiring glyphs from Latin-2; a smart application will locate an additional font to fill-in the missing glyphs, the result looks like a ransom note. The hope is that proper language tags in the document can avoid this at the start by making the first font contain the proper coverage for the entire block of text. This goal is reflected in the design I outlined -- fonts are deemed suitable for a particular language when they cover a significant fraction of the codepoints commonly associated with that language. Suppose there's a document tagged as zh_TW that explains how PRC government simplified Chinese characters to boost the literacy rate after WW II. If a Big5 font (that doesn't cover all characters in the doc) is selected instead of a GBK/GB18030 font (with the full coverage), simplified Han characters(not used in Taiwan but only used in PRC) in the doc have to be rendered with another font (most likely GB2312/GBK/GB18030 font). A correct version of this document would tag individual sections of the document with appropriate tags. This way, the zh_TW sections could be presented in a traditional Chinese font while the mainland portions are displayed with simplified Chinese glyphs. I don't know how prevalent language tagging is in office document formats, but it's certainly available in HTML. It's the HTML case that started my journey into language tags. I'm not sure what you meant by 'glyph forms are more likely simplified'. You might have misunderstood some aspects of Han Unification in Unicode/10646. In Unicode, simplified forms of Chinese characters are NOT unified with corresponding traditional forms of Chinese characters. You're right -- I didn't believe this to be the case. I had heard that the unified portion within the BMP do co-mingle simplified and traditional forms, but that the non-BMP Han extension provide separate codepoints for each. If even BMP codepoints are separate, then it should be possible to create a large set of codepoints which could mark fonts as suitable for the display of simplified Chinese which are distinct from the set of codepoitns suitable for the display of traditional Chinese. That would be nicer than my current kludge of marking any font suitable for traditional chinese as unsuitable for simplified Chinese. Keith PackardXFree86 Core TeamHP Cambridge Research Lab ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
On Sat, 29 Jun 2002, Jungshik Shin wrote: On Fri, 28 Jun 2002, Keith Packard wrote: I'm confused by this; my exposure to Chinese fonts says that simplified Chinese and traditional Chinese have significant overlap in Unicode codepoints, but that the glyphs are quite a bit different in appearance. I doubt this is the case. As far as I can tell I found this needs some clarification. If glyphs of 'A', 'B' and 'C' from Times Roman Latin-1 font are compared with corresponding glyphs from New Century Schoolbook Latin-2 font, they look certainly different. However, that does not mean that you cannot use Times Roman Latin-1 font to render a run of text in one of languages Latin-2 is meant for as long as Times-Roman Latin-1 font has _all_ the glyphs necessary in that particular run of text. I believe the same thing can happen between two fonts for zh-TW and zh-CN. If glyphs from font A for zh-TW are compared with glyphs from font B (with different design principles) for zh-CN, they for sure look different. However, they're different not because font A is for zh-TW and font B is for zh-CN but because they're designed to appear different. Chinese and traditional Chinese have significant overlap in Unicode codepoints, but that the glyphs are quite a bit different in appearance. To make this kind of comparison meaningful, you have to compare two fonts, one for zh-TW and the other for zh-CN, made by a _single_ foundry with the _identical_ design principles and look and feel (something like Adobe Times Roman Latin-1 font and Adobe Times Roman Latin-2 font). In practice, it's hard to find two fonts that satisfy the crieteria I outlined here. However, ISO 10646 code charts for Han characters should do almost as good a job. That's why I suggested comparing glyphs for PRC and Taiwan in the ISO 10646 Han character chart. Jungshik Shin ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
I wrote earlier: Actually, it is better changed to if (covers_almost_all_of (GB2312)) font supports traditional Chinese if (covers_almost_all_of (Big5)) font supports traditional Chinese It should be if (covers_almost_all_of (GB2312)) font supports SIMPLIFIED Chinese if (covers_almost_all_of (Big5)) font supports traditional Chinese Sorry about the typo. ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
Kaixo! On Sat, Jun 29, 2002 at 09:34:43AM -0700, Keith Packard wrote: This goal is reflected in the design I outlined -- fonts are deemed suitable for a particular language when they cover a significant fraction of the codepoints commonly associated with that language. That is inacceptable. A font is suited for a given language when it covers *ALL* of the codepoints needed for that language. The only exception in checking *all* of the needed codepoints is that of CJK languages, that is because: - there is a very small set of such languages - the fonts are designed with coverage of one of them in mind - the mandatory glyphs needed for a given CJK language that don't overlap with any other CJK language make a quit big set, allowing to test just a carefully chose and small set of glyphs, and assume that all other glyphs needed for a given CJK language are present too. Maybe also scripts used for one and only one language can be handled withotu the need to check all the needed codepoints (but on the other hand they always form a small amount of codepoints, so checking them all is not a problem) But for the big majority of languages, that are not the only ones written with a given script, just checking coverage of a signifiant fraction is not enough. For example Spanish, it needs the a-z letters plus áéíóúüñ (that is, aacute, eacute, iacute, oacute, uacute, udiaeresis and ntilde). If only one of these is missing then you cannot render a Spanish text correctly, even if out of the 66 chars (33 lowercase, 33 upercase) the font covers 65 of them, it is still not suitable to properly render Spanish text (it may get unnoticed if the text just happens to don't use the missing letter, but relying in chance is not very serious) So, the tests for CJK languages and for other languages are clearly different, only CJK languages can go with testing only a signifiant fraction, for all other languages all chars must be tested. Suppose there's a document tagged as zh_TW that explains how PRC government simplified Chinese characters to boost the literacy rate after WW II. If a Big5 font (that doesn't cover all characters in the doc) is selected instead of a GBK/GB18030 font (with the full coverage), simplified Han characters(not used in Taiwan but only used in PRC) in the doc have to be rendered with another font (most likely GB2312/GBK/GB18030 font). A correct version of this document would tag individual sections of the document with appropriate tags. This way, the zh_TW sections could be presented in a traditional Chinese font while the mainland portions are displayed with simplified Chinese glyphs. Indeed. I wonder however how place names are handled. Are there place names with names using hanzi that don't exist in simplified form ? If so, what would be the preferred solution to write such a place name in a simplified Chinese text ? Same question for people names. -- Ki ça vos våye bén, Pablo Saratxaga http://chanae.stben.be/pablo/ PGP Key available, key ID: 0xD9B85466 [you can write me in Walloon, Spanish, French, English, Italian or Portuguese] ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
Keith Packard wrote: Actually, I could really use as many Han fonts as you have, especially if they are from different vendors and of different ages. All I really need is the fonts.cache files generated from these fonts; that holds the unicode coverage and any OS/2 table information. That would be a lot smaller, and also avoid any copyright or trade secret problems. Sure, I will install as many Chinese fonts as possible and get the fonts.cache for you. But before that, I will show you serveral lines in my fonts.cache: /usr/share/fonts/zh_CN/TrueType/zysong.ttf 0 1017360509 ZYSong18030:style=regular:slant=0:weight=100:index=0:outline=True:scalable=True:charset=:lang=simplifiedchinese /usr/share/fonts/zh_CN/TrueType/SimSun18030.ttc 0 1021954464 SimSun\\-18030:style=regular:slant=0:weight=100:spacing=100:index=0:outline=True:scalable=True:charset= |^1!|^1!P0oWQ |^1!|^1!|^1%#$XIJ7!!7K/!#@#g!BBH1!!K? )rmR!!^^7$!!!)$ !!71$$ 9;+63 !!!.%|J~~|K0}!!!0~ !!!1|T)$|^1!!!B7$ !!!7)RfF}m#|7NW!!!?*;5CsY!BB.k9WOSb!%TBD !!!T4|^1!|^1!|^+~|K?){{7T3q~Ki]!!(bt !!!r?#?7uT|^1!|^1!!BB.!|^11% !!#0GMHs3pVcw5 !!!W5 !!#3H!)pZ;) #?3x7#8%{O !!#6IsBH2E/Xr5/!!Ku;!!)q/!dOIP0oWu !!#9J!!K? !!#K |;y1s(1+e4 !!#AL|^1!|^1!|T^4!#f04!)*$a4LXyi!!*.[f!!#DM!!!*2 ( !!#]U !2bz#$oxJj!!!1 !!#bV (0~]4!!#eWF3yz9WIxl|^0~|^1!MX|rY|^0~|^1!K2Fxo!!#hX|^0^!!!1% !!#kY !!7?( + !!#nZJ~mcX$!){H !!#q[|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!#t]|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!#w^|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!#za|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!#}b|^1!|^1!|^1!|^! 1!|^1!|^1!|^1!|^1!!!$#c|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$d|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$*e|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$/f|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$2g|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$5h|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$8i|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$j|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$@k|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$Cl|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$Fm|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$In|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$Lo|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$Op|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$Rq|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$Ur|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$Xs|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$[t|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$au|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$dv|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$gw|^1!|^1!|^1!|^1!|^1!!)pSi !!$jx|^1!|^1!|^1!|^1!|^1!|! ^1!|^1!|^1!!!$my|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$pz| ^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$s{|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$v||^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$y}|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$|~|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%#!|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%#|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%*$|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%/%|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%2|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%5(|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%8)|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%*|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%@+|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%C.|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%F/|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%I0|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%L1|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%O2|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%R3|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%U4|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%X5|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%[6|! ^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%a7|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%d8|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%g9|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%j;|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%m|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%p|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%s?|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%v@|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%yA|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%|B|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1C|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%D|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!)E|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!.F|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!1G|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!4H|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!7I|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!;J|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!?K|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!BL|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!EM|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!HN|^! 1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!KO|^1!|^1!|^1!|^1!| ^1!|^1!|^1!|^1!!!NP|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!QQ|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!TR|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!WS|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!ZT|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!^U|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!cV|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!fW|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!iX|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!lY|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!oZ|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!r[|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!u]|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!x^|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!{a|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!~b|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!($c|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!((d|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!(+e|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!(0f|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!(3g|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!(6h|^1!|^1!|^1!|^1!|^!
Re: [Fonts]Automatic 'lang' determination
On Sat, 29 Jun 2002, Yao Zhang wrote: It should be if (covers_almost_all_of (GB2312)) font supports SIMPLIFIED Chinese if (covers_almost_all_of (Big5)) font supports traditional Chinese After sending my prev. message, I read this and I have to agree with this. This is better than what I sent earlier. Just forgetting about GB18030/GBK coverage and concentrating on GB2312 and Big5 coverage is simpler as well as better. Jungshik Shin ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
Around 13 o'clock on Jun 29, Yao Zhang wrote: if (covers_much_of (gb18030)) font supports simplified Chinese if (covers_almost_all_of (Big5)) font supports traditional Chinese font does not support simplified Chinese For a GB18030 font, since it covers much of GB18030 set, it suports simplified Chinese. And is also covers almost all of BIG5, so it supports traditional Chinese too. But now the algorithm excludes it from simplified Chinese support. The last line is wrong. Yes, I think the problem is that I'm using GBK for the test instead of GB2312 -- I got the simplified coverage information from codepage 936 which is based on GBK. The fonts I have don't cover most of GBK, but do cover nearly all of GB2312. if (covers_almost_all_of (GB2312)) font supports SIMPLIFIED Chinese if (covers_almost_all_of (Big5)) font supports traditional Chinese Thanks, this works just fine. I'm much happier with this solution. Keith PackardXFree86 Core TeamHP Cambridge Research Lab ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
Around 20 o'clock on Jun 29, Pablo Saratxaga wrote: A font is suited for a given language when it covers *ALL* of the codepoints needed for that language. Yes, that's obviously true, but the problem is that I don't have tables for each language indicating the required codepoints, all I have are tables listing Unicode values in encodings traditionally used for each language. These tables almost always include a few (1-5) glyphs which many fonts are missing. So, the test is to require that the number of missing glyphs for non-Han languages is very small (8) to allow fonts which happen to be missing only a few unimportant glyphs to be used. Discovering which glyphs in each encoding are problematic in many fonts would allow this fudge factor to be reduced further. So, the tests for CJK languages and for other languages are clearly different, only CJK languages can go with testing only a signifiant fraction, for all other languages all chars must be tested. Yes, the tolerance value given for the Han languages is 500 codepoints while the value for non-Han languages is two orders of magnitude smaller. Keith PackardXFree86 Core TeamHP Cambridge Research Lab ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
Around 14 o'clock on Jun 29, Yao Zhang wrote: Sure, I will install as many Chinese fonts as possible and get the fonts.cache for you. But before that, I will show you serveral lines in my fonts.cache: I'm afraid the mailers corrupted the rather long lines in those files, but given that I've discovered that GB2312 is a relatively strong test for suitability for simplified chinese, perhaps we can avoid sending this data at all. Now for lang, ZYSong18030 is labelled as lang=simplifiedchinese while SimSun-18030 is labelled as lang=latin1,arabic,simplifiedchinese,koreanwansung,traditionalchinese,koreanjohab,arabic864,arabicasmo708,us These language tags come from the OS/2 table and are set by the font designer. If, as our friend Jungshik Shin says, simplified forms were not unified with traditional forms in the BMP, then it's quite reasonable to build a font that can cover both languages. With the new improved GB2312-based simplified test, I suspect the correct languages would be generated automatically from this font as well. I've gone ahead and committed the changes necessary for automatic lang determination to XFree86 CVS; those interested in verifying it's sensitivity and specificity are welcome to check it out and run: $ FC_DEBUG=256 fc-cache -f This will display the number of missing glyphs in each language for each font and also display errors in the lang value relative to that specified in the TrueType file. Keith PackardXFree86 Core TeamHP Cambridge Research Lab ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
Kaixo! On Sat, Jun 29, 2002 at 01:20:34PM -0700, Keith Packard wrote: A font is suited for a given language when it covers *ALL* of the codepoints needed for that language. Yes, that's obviously true, but the problem is that I don't have tables for each language indicating the required codepoints, all I have are tables listing Unicode values in encodings traditionally used for each language. These tables almost always include a few (1-5) glyphs which many fonts are missing. What are those glyphs? (I'm quite surprised, I would have expected the opposite: fonts generally have more glyphs than the standard encodings of the sio-8859 family for example) So, the tests for CJK languages and for other languages are clearly different, only CJK languages can go with testing only a signifiant fraction, for all other languages all chars must be tested. Yes, the tolerance value given for the Han languages is 500 codepoints while the value for non-Han languages is two orders of magnitude smaller. No, the tolerance for missing glyphs in CJK tests should be the same or even smaller. The difference is that it isn't needed to test all the glyphs for CJK coverages; testing only a set of 256 choose glyphs would be enough (if they are correctly choosen, testing that 256 glyphs are present in a font is enough to assure, with 99.99% of confidence, that it covers a given CJK language). That cannot be done for the 8bit latin/cyrillic encodings because there is too much overlapping between them (in the case of iso-8859-1/iso-8859-15 the overlapping is of 97% for example). While there is also a lot of overlapping between CJK encodings, there are large plages of non overlaping chars, chars that appear only in the japanese encoding, or only in gb2312, or only in big5 etc. (I mean by only: not in any other widely used legacy encoding, so explicitely excluding unicode that of course includes them all). As those exclusive chars are numerous enough it is possbile to test for the presence of some of them in a font and determine a language coverage from there. Of course, complete checking can also be done, but I wonder if it is actually useful (I mean, is there a font suitable for simplified chinese out there that doesn't encode all the characters of gb2312? It would be like a font for English that is missing the r letter). Big5 is a bit more problematic, as there is no such a thing as a well defined Big5 encoding, but rather, in the pure Microsoftian tradition (big5 comes after all from that side) a number of revisions all named the same, that adds some characters, and an older font can miss some chars that a newer one has (according to a newer definition of big5). But to handle such case, I think it would be better to choose a given definition of big5 (or several of them) and stick to it, rather than allowing a so tremendously big hole as 500 possible missing chars. -- Ki ça vos våye bén, Pablo Saratxaga http://chanae.stben.be/pablo/ PGP Key available, key ID: 0xD9B85466 [you can write me in Walloon, Spanish, French, English, Italian or Portuguese] ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts
Re: [Fonts]Automatic 'lang' determination
Keith Packard wrote: Around 14 o'clock on Jun 29, Yao Zhang wrote: Sure, I will install as many Chinese fonts as possible and get the fonts.cache for you. But before that, I will show you serveral lines in my fonts.cache: I'm afraid the mailers corrupted the rather long lines in those files, but given that I've discovered that GB2312 is a relatively strong test for suitability for simplified chinese, perhaps we can avoid sending this data at all. Now for lang, ZYSong18030 is labelled as lang=simplifiedchinese while SimSun-18030 is labelled as lang=latin1,arabic,simplifiedchinese,koreanwansung,traditionalchinese,koreanjohab,arabic864,arabicasmo708,us These language tags come from the OS/2 table and are set by the font designer. If, as our friend Jungshik Shin says, simplified forms were not unified with traditional forms in the BMP, then it's quite reasonable to build a font that can cover both languages. Although both zysong and simsun are both from Beijing Zhongyi, but zysong in Red Hat 7.3 is purely a GB18030 font file, it only contains the characters defined in GB18030 standard. And simsun does provide extra characters to support other language like japanese etc. So the os2 table says so. Regards, Shao With the new improved GB2312-based simplified test, I suspect the correct languages would be generated automatically from this font as well. I've gone ahead and committed the changes necessary for automatic lang determination to XFree86 CVS; those interested in verifying it's sensitivity and specificity are welcome to check it out and run: $ FC_DEBUG=256 fc-cache -f This will display the number of missing glyphs in each language for each font and also display errors in the lang value relative to that specified in the TrueType file. Keith PackardXFree86 Core TeamHP Cambridge Research Lab ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts ___ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts