subject:"\[Fonts\]Automatic 'lang' determination"

Re: [Fonts]Automatic 'lang' determination

2002-06-30 Thread Pablo Saratxaga


Kaixo!

On Sat, Jun 29, 2002 at 05:17:04PM -0700, Keith Packard wrote:
 
  What are those glyphs? (I'm quite surprised, I would have expected the
  opposite: fonts generally have more glyphs than the standard encodings of
  the sio-8859 family for example)
 
 My definition of language tag is coloured by the OS/2 table codePageRange 
 bits from which is was originally defined in fontconfig.  Those bits are 
 defined to map to specific Windows code pages; the Latin-1 case doesn't 
 map to ISO 8859-1, but rather to code page 1252 for which many fonts are 
 missing a few random entries.

But what characters are those?
It is possible that they are the onesthat have been added to cp1252
and that didn't existed some years ago?
I think the matching should be done against the lowest denominator
and be strict; or to give different weights to the miss of *letters*
or other symbols (it may be more or less acceptable to get quotation
marks from another font; bUt lEttErs frOm A dIffErEnt fOnts Is vErY UglY).

  No, the tolerance for missing glyphs in CJK tests should be the same or
  even smaller. The difference is that it isn't needed to test all the glyphs
  for CJK coverages; testing only a set of 256 choose glyphs would be enough
  (if they are correctly choosen, testing that 256 glyphs are present in a
  font is enough to assure, with 99.99% of confidence, that it covers a given
  CJK language).
 
 I'm not confident enough of this approach; I fear that any set of 256 
 glyphs that must appear in a simplified Chinese font may well appear in 
 many traditional Chinese (or even Japanese) fonts.

Most do, of course, but there are a lot that don't.
I only dealt with a ~10-15 ttf CJK fonts, but never had false positives
using that method.

 out there that doesn't encode all the characters of gb2312?
 
 It seems that this must be the case -- I set the '500' number so high 
 because all of the fonts which I have that advertise support for 
 simplified Chinese are missing over 200 glyphs from GB2312.  I got
 similar results for Japanese fonts, Korean Wansung fonts and traditional 
 Chinese fonts.

But what characters are those missing?
Could it be that those are semi-graphic ones, or scripts used by other
languages (eg: cyrillic, greek, japanese kana in chinese font, etc).
Here too, different weights should be used, it is not a big problem if
a CJK font is missing cyrillic, a font designed for russian will be a much
better choice to render cyrillic anyway; but it may be a big problem if
some needed characters are missing.

And I'm really surprised by such a high number as 200.
Are you sure you tested against gb2312 and not agains the Microsoft
codepage based on it (that surely adds several extra characters) ?

 But to handle such case, I think it would be better to choose a given
 definition of big5 (or several of them) and stick to it, rather than
 allowing a so tremendously big hole as 500 possible missing chars.
 
 Missing 500 from a repertoire of nearly 2 doesn't seem to render most 
 of these fonts unusable.

It could, it depends on what glyphs are missing.


-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://chanae.stben.be/pablo/   PGP Key available, key ID: 0xD9B85466
[you can write me in Walloon, Spanish, French, English, Italian or Portuguese]
___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts

Re: [Fonts]Automatic 'lang' determination

2002-06-30 Thread Yu Shao


Pablo Saratxaga wrote:

Kaixo!

On Sat, Jun 29, 2002 at 05:17:04PM -0700, Keith Packard wrote:
 

What are those glyphs? (I'm quite surprised, I would have expected the
opposite: fonts generally have more glyphs than the standard encodings of
the sio-8859 family for example)

My definition of language tag is coloured by the OS/2 table codePageRange 
bits from which is was originally defined in fontconfig.  Those bits are 
defined to map to specific Windows code pages; the Latin-1 case doesn't 
map to ISO 8859-1, but rather to code page 1252 for which many fonts are 
missing a few random entries.


But what characters are those?
It is possible that they are the onesthat have been added to cp1252
and that didn't existed some years ago?
I think the matching should be done against the lowest denominator
and be strict; or to give different weights to the miss of *letters*
or other symbols (it may be more or less acceptable to get quotation
marks from another font; bUt lEttErs frOm A dIffErEnt fOnts Is vErY UglY).

No, the tolerance for missing glyphs in CJK tests should be the same or
even smaller. The difference is that it isn't needed to test all the glyphs
for CJK coverages; testing only a set of 256 choose glyphs would be enough
(if they are correctly choosen, testing that 256 glyphs are present in a
font is enough to assure, with 99.99% of confidence, that it covers a given
CJK language).

I'm not confident enough of this approach; I fear that any set of 256 
glyphs that must appear in a simplified Chinese font may well appear in 
many traditional Chinese (or even Japanese) fonts.


Most do, of course, but there are a lot that don't.
I only dealt with a ~10-15 ttf CJK fonts, but never had false positives
using that method.

out there that doesn't encode all the characters of gb2312?

It seems that this must be the case -- I set the '500' number so high 
because all of the fonts which I have that advertise support for 
simplified Chinese are missing over 200 glyphs from GB2312.  I got
similar results for Japanese fonts, Korean Wansung fonts and traditional 
Chinese fonts.


But what characters are those missing?
Could it be that those are semi-graphic ones, or scripts used by other
languages (eg: cyrillic, greek, japanese kana in chinese font, etc).
Here too, different weights should be used, it is not a big problem if
a CJK font is missing cyrillic, a font designed for russian will be a much
better choice to render cyrillic anyway; but it may be a big problem if
some needed characters are missing.

And I'm really surprised by such a high number as 200.
Are you sure you tested against gb2312 and not agains the Microsoft
codepage based on it (that surely adds several extra characters) ?

Hi Keith,

Checking against fontenc,  Both AR PL SungtiL GB andAR PL KaitiM GB 
provide all GB2312's 7445 characters which include 6763 Hanzis and 682 
symbols. fc-cache reports 204 missing seems not correct?

Regards,



But to handle such case, I think it would be better to choose a given
definition of big5 (or several of them) and stick to it, rather than
allowing a so tremendously big hole as 500 possible missing chars.

Missing 500 from a repertoire of nearly 2 doesn't seem to render most 
of these fonts unusable.


It could, it depends on what glyphs are missing.




-- 
Yu Shao
Red Hat Asia-Pacific
+61 7 3872 4835
Legal:   http://apac.redhat.com/disclaimer



___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts

Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Keith Packard



Around 9 o'clock on Jun 29, Jungshik Shin wrote:

 IMHO, most problems with Han Unification arise not from using a _single_
 font targeted at one of zh_TW/zh_CN/ja/ko to render a run of text in
 another but from mixing _multiple_ fonts (with _drastically different_
 design principle and other differences like baseline) to render a single
 run of text (say, 65% of characters drawn from one font, 25% from a second
 font, 7% from a third font, etc).

Yes, I agree -- this is true in Western languages as well where the 
application selects a font covering only Latin-1 and attempts to display 
text requiring glyphs from Latin-2; a smart application will locate an 
additional font to fill-in the missing glyphs, the result looks like a 
ransom note.

The hope is that proper language tags in the document can avoid this at 
the start by making the first font contain the proper coverage for the 
entire block of text.

This goal is reflected in the design I outlined -- fonts are deemed 
suitable for a particular language when they cover a significant 
fraction of the codepoints commonly associated with that language.

 Suppose there's a document tagged as zh_TW that explains how PRC government
 simplified Chinese characters to boost the literacy rate after WW II. If a
 Big5 font (that doesn't cover all characters in the doc) is selected
 instead of a GBK/GB18030 font (with the full coverage), simplified Han
 characters(not used in Taiwan but only used in PRC) in the doc have to be
 rendered with another font (most likely GB2312/GBK/GB18030 font).

A correct version of this document would tag individual sections of the
document with appropriate tags.  This way, the zh_TW sections could be
presented in a traditional Chinese font while the mainland portions are
displayed with simplified Chinese glyphs.

I don't know how prevalent language tagging is in office document formats, 
but it's certainly available in HTML.  It's the HTML case that started my 
journey into language tags.

  I'm not sure what you meant by 'glyph forms are more likely
 simplified'. You might have misunderstood some aspects of Han Unification
 in Unicode/10646.  In Unicode, simplified forms of Chinese characters are
 NOT unified with corresponding traditional forms of Chinese characters.

You're right -- I didn't believe this to be the case.  I had heard that the
unified portion within the BMP do co-mingle simplified and traditional
forms, but that the non-BMP Han extension provide separate codepoints for
each.

If even BMP codepoints are separate, then it should be possible to create 
a large set of codepoints which could mark fonts as suitable for the 
display of simplified Chinese which are distinct from the set of 
codepoitns suitable for the display of traditional Chinese.   That would 
be nicer than my current kludge of marking any font suitable for 
traditional chinese as unsuitable for simplified Chinese.

Keith PackardXFree86 Core TeamHP Cambridge Research Lab


___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts

Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Jungshik Shin





On Sat, 29 Jun 2002, Jungshik Shin wrote:
 On Fri, 28 Jun 2002, Keith Packard wrote:

  I'm confused by this; my exposure to Chinese fonts says that simplified
  Chinese and traditional Chinese have significant overlap in Unicode
  codepoints, but that the glyphs are quite a bit different in appearance.

   I doubt this is the case. As far as I can tell

  I found this needs some clarification.  If glyphs of 'A', 'B'
and 'C' from Times Roman Latin-1  font are compared with corresponding
glyphs from New Century Schoolbook Latin-2 font, they look certainly
different. However, that does not mean that you cannot use Times Roman
Latin-1 font to render a run of text in one of languages Latin-2 is meant
for as long as Times-Roman Latin-1 font has _all_ the glyphs necessary in
that particular run of text.

  I believe the same thing can happen between two fonts for
zh-TW and zh-CN. If glyphs from font A for zh-TW are compared with glyphs
from font B (with different design principles) for zh-CN, they for sure
look different. However, they're different not because font A is for zh-TW
and font B is for zh-CN but because they're designed to appear different.

  Chinese and traditional Chinese have significant overlap in Unicode
  codepoints, but that the glyphs are quite a bit different in appearance.

  To make this kind of comparison meaningful, you have to compare
two fonts, one for zh-TW and the other for zh-CN, made by a _single_
foundry with the _identical_ design principles and look and feel
(something like Adobe Times Roman Latin-1 font and Adobe Times Roman
Latin-2 font).

  In practice, it's hard to find two fonts that satisfy the crieteria I
outlined here.  However, ISO 10646 code charts for Han characters should
do almost as good a job.  That's why I suggested comparing glyphs for
PRC and Taiwan in the ISO 10646 Han character chart.

   Jungshik Shin

___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts

Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Yao Zhang


I wrote earlier:
 
 Actually, it is better changed to
   if (covers_almost_all_of (GB2312))
   font supports traditional Chinese
   if (covers_almost_all_of (Big5))
   font supports traditional Chinese
 
It should be

if (covers_almost_all_of (GB2312))
font supports SIMPLIFIED Chinese
if (covers_almost_all_of (Big5))
font supports traditional Chinese

Sorry about the typo.
___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts

Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Pablo Saratxaga


Kaixo!

On Sat, Jun 29, 2002 at 09:34:43AM -0700, Keith Packard wrote:
 
 This goal is reflected in the design I outlined -- fonts are deemed 
 suitable for a particular language when they cover a significant 
 fraction of the codepoints commonly associated with that language.

That is inacceptable.
A font is suited for a given language when it covers *ALL* of the codepoints
needed for that language.

The only exception in checking *all* of the needed codepoints is that
of CJK languages, that is because:
- there is a very small set of such languages
- the fonts are designed with coverage of one of them in mind
- the mandatory glyphs needed for a given CJK language that don't
  overlap with any other CJK language make a quit big set, allowing
  to test just a carefully chose and small set of glyphs, and assume
  that all other glyphs needed for a given CJK language are present too.

Maybe also scripts used for one and only one language can be handled
withotu the need to check all the needed codepoints (but on the other hand
they always form a small amount of codepoints, so checking them all is
not a problem)

But for the big majority of languages, that are not the only ones written 
with a given script, just checking coverage of a signifiant fraction
is not enough.

For example Spanish, it needs the a-z letters plus áéíóúüñ (that is, aacute,
eacute, iacute, oacute, uacute, udiaeresis and ntilde).
If only one of these is missing then you cannot render a Spanish text
correctly, even if out of the 66 chars (33 lowercase, 33 upercase) the
font covers 65 of them, it is still not suitable to properly render
Spanish text (it may get unnoticed if the text just happens to don't
use the missing letter, but relying in chance is not very serious)

So, the tests for CJK languages and for other languages are clearly different,
only CJK languages can go with testing only a signifiant fraction,
for all other languages all chars must be tested.
 
  Suppose there's a document tagged as zh_TW that explains how PRC government
  simplified Chinese characters to boost the literacy rate after WW II. If a
  Big5 font (that doesn't cover all characters in the doc) is selected
  instead of a GBK/GB18030 font (with the full coverage), simplified Han
  characters(not used in Taiwan but only used in PRC) in the doc have to be
  rendered with another font (most likely GB2312/GBK/GB18030 font).
 
 A correct version of this document would tag individual sections of the
 document with appropriate tags.  This way, the zh_TW sections could be
 presented in a traditional Chinese font while the mainland portions are
 displayed with simplified Chinese glyphs.

Indeed.

I wonder however how place names are handled. Are there place names with
names using hanzi that don't exist in simplified form ?
If so, what would be the preferred solution to write such a place name
in a simplified Chinese text ?
Same question for people names.

-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://chanae.stben.be/pablo/   PGP Key available, key ID: 0xD9B85466
[you can write me in Walloon, Spanish, French, English, Italian or Portuguese]
___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts

Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Yao Zhang


Keith Packard wrote:

 Actually, I could really use as many Han fonts as you have, especially if
 they are from different vendors and of different ages.  All I really need
 is the fonts.cache files generated from these fonts; that holds the unicode
 coverage and any OS/2 table information.  That would be a lot smaller, and
 also avoid any copyright or trade secret problems.

Sure, I will install as many Chinese fonts as possible and get the
fonts.cache for you.  But before that, I will show you serveral lines in my
fonts.cache:

/usr/share/fonts/zh_CN/TrueType/zysong.ttf 0 1017360509 
ZYSong18030:style=regular:slant=0:weight=100:index=0:outline=True:scalable=True:charset=:lang=simplifiedchinese
/usr/share/fonts/zh_CN/TrueType/SimSun18030.ttc 0 1021954464 
SimSun\\-18030:style=regular:slant=0:weight=100:spacing=100:index=0:outline=True:scalable=True:charset=
  |^1!|^1!P0oWQ |^1!|^1!|^1%#$XIJ7!!7K/!#@#g!BBH1!!K? )rmR!!^^7$!!!)$  
!!71$$  9;+63 !!!.%|J~~|K0}!!!0~ !!!1|T)$|^1!!!B7$ 
!!!7)RfF}m#|7NW!!!?*;5CsY!BB.k9WOSb!%TBD !!!T4|^1!|^1!|^+~|K?){{7T3q~Ki]!!(bt 
!!!r?#?7uT|^1!|^1!!BB.!|^11%  !!#0GMHs3pVcw5  !!!W5  !!#3H!)pZ;) 
#?3x7#8%{O   !!#6IsBH2E/Xr5/!!Ku;!!)q/!dOIP0oWu  !!#9J!!K?   !!#K   |;y1s(1+e4  
 !!#AL|^1!|^1!|T^4!#f04!)*$a4LXyi!!*.[f!!#DM!!!*2 ( !!#]U
!2bz#$oxJj!!!1 !!#bV   
(0~]4!!#eWF3yz9WIxl|^0~|^1!MX|rY|^0~|^1!K2Fxo!!#hX|^0^!!!1%  !!#kY !!7?(   
+  !!#nZJ~mcX$!){H 
!!#q[|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!#t]|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!#w^|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!#za|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!#}b|^1!|^1!|^1!|^!
1!|^1!|^1!|^1!|^1!!!$#c|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$d|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$*e|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$/f|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$2g|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$5h|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$8i|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$j|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$@k|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$Cl|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$Fm|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$In|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$Lo|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$Op|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$Rq|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$Ur|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$Xs|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$[t|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$au|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$dv|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$gw|^1!|^1!|^1!|^1!|^1!!)pSi
  !!$jx|^1!|^1!|^1!|^1!|^1!|!
^1!|^1!|^1!!!$my|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$pz|
^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$s{|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$v||^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$y}|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!$|~|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%#!|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%#|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%*$|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%/%|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%2|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%5(|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%8)|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%*|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%@+|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%C.|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%F/|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%I0|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%L1|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%O2|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%R3|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%U4|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%X5|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%[6|!
^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%a7|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%d8|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%g9|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%j;|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%m|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%p|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%s?|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%v@|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%yA|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%|B|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1C|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!%D|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!)E|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!.F|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!1G|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!4H|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!7I|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!;J|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!?K|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!BL|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!EM|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!HN|^!
1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!KO|^1!|^1!|^1!|^1!|
^1!|^1!|^1!|^1!!!NP|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!QQ|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!TR|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!WS|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!ZT|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!^U|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!cV|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!fW|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!iX|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!lY|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!oZ|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!r[|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!u]|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!x^|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!{a|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!~b|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!($c|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!((d|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!(+e|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!(0f|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!(3g|^1!|^1!|^1!|^1!|^1!|^1!|^1!|^1!!!(6h|^1!|^1!|^1!|^1!|^!

Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Jungshik Shin


On Sat, 29 Jun 2002, Yao Zhang wrote:

 It should be

   if (covers_almost_all_of (GB2312))
   font supports SIMPLIFIED Chinese
   if (covers_almost_all_of (Big5))
   font supports traditional Chinese

  After sending my prev. message, I read this and I have to
agree with this. This is better than what I sent earlier.  Just forgetting
about GB18030/GBK coverage and concentrating on GB2312 and Big5 coverage
is simpler as well as better.

  Jungshik Shin

___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts

Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Keith Packard



Around 13 o'clock on Jun 29, Yao Zhang wrote:

   if (covers_much_of (gb18030))
   font supports simplified Chinese
   if (covers_almost_all_of (Big5))
   font supports traditional Chinese
   font does not support simplified Chinese
 
 For a GB18030 font, since it covers much of GB18030 set, it suports
 simplified Chinese.  And is also covers almost all of BIG5, so it
 supports traditional Chinese too.  But now the algorithm excludes it
 from simplified Chinese support.  The last line is wrong.

Yes, I think the problem is that I'm using GBK for the test instead of
GB2312 -- I got the simplified coverage information from codepage 936 which
is based on GBK.

The fonts I have don't cover most of GBK, but do cover nearly all of 
GB2312.  

   if (covers_almost_all_of (GB2312))
   font supports SIMPLIFIED Chinese
   if (covers_almost_all_of (Big5))
   font supports traditional Chinese

Thanks, this works just fine.  I'm much happier with this solution.

Keith PackardXFree86 Core TeamHP Cambridge Research Lab


___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts

Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Keith Packard



Around 20 o'clock on Jun 29, Pablo Saratxaga wrote:

 A font is suited for a given language when it covers *ALL* of the codepoints
 needed for that language.

Yes, that's obviously true, but the problem is that I don't have tables for
each language indicating the required codepoints, all I have are tables
listing Unicode values in encodings traditionally used for each language.
These tables almost always include a few (1-5) glyphs which many fonts are
missing.

So, the test is to require that the number of missing glyphs for non-Han 
languages is very small (8) to allow fonts which happen to be missing 
only a few unimportant glyphs to be used.  Discovering which glyphs in 
each encoding are problematic in many fonts would allow this fudge factor 
to be reduced further.

 So, the tests for CJK languages and for other languages are clearly different,
 only CJK languages can go with testing only a signifiant fraction,
 for all other languages all chars must be tested.

Yes, the tolerance value given for the Han languages is 500 codepoints 
while the value for non-Han languages is two orders of magnitude smaller.

Keith PackardXFree86 Core TeamHP Cambridge Research Lab


___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts

Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Keith Packard



Around 14 o'clock on Jun 29, Yao Zhang wrote:

 Sure, I will install as many Chinese fonts as possible and get the
 fonts.cache for you.  But before that, I will show you serveral lines in my
 fonts.cache:

I'm afraid the mailers corrupted the rather long lines in those files, but 
given that I've discovered that GB2312 is a relatively strong test for 
suitability for simplified chinese, perhaps we can avoid sending this data 
at all.

 Now for lang, ZYSong18030 is labelled as
 lang=simplifiedchinese
 while SimSun-18030 is labelled as
 
lang=latin1,arabic,simplifiedchinese,koreanwansung,traditionalchinese,koreanjohab,arabic864,arabicasmo708,us

These language tags come from the OS/2 table and are set by the font 
designer.  If, as our friend Jungshik Shin says, simplified forms were
not unified with traditional forms in the BMP, then it's quite reasonable 
to build a font that can cover both languages.

With the new improved GB2312-based simplified test, I suspect the correct 
languages would be generated automatically from this font as well.

I've gone ahead and committed the changes necessary for automatic lang 
determination to XFree86 CVS; those interested in verifying it's 
sensitivity and specificity are welcome to check it out and run:

$ FC_DEBUG=256 fc-cache -f

This will display the number of missing glyphs in each language for each 
font and also display errors in the lang value relative to that specified 
in the TrueType file.

Keith PackardXFree86 Core TeamHP Cambridge Research Lab


___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts

Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Pablo Saratxaga


Kaixo!

On Sat, Jun 29, 2002 at 01:20:34PM -0700, Keith Packard wrote:
 
  A font is suited for a given language when it covers *ALL* of the codepoints
  needed for that language.
 
 Yes, that's obviously true, but the problem is that I don't have tables for
 each language indicating the required codepoints, all I have are tables
 listing Unicode values in encodings traditionally used for each language.
 These tables almost always include a few (1-5) glyphs which many fonts are
 missing.

What are those glyphs?
(I'm quite surprised, I would have expected the opposite: fonts generally
have more glyphs than the standard encodings of the sio-8859 family
for example)

 So, the tests for CJK languages and for other languages are clearly different,
 only CJK languages can go with testing only a signifiant fraction,
 for all other languages all chars must be tested.
 
 Yes, the tolerance value given for the Han languages is 500 codepoints 
 while the value for non-Han languages is two orders of magnitude smaller.

No, the tolerance for missing glyphs in CJK tests should be the
same or even smaller.
The difference is that it isn't needed to test all the glyphs for CJK
coverages; testing only a set of 256 choose glyphs would be enough
(if they are correctly choosen, testing that 256 glyphs are present in
a font is enough to assure, with 99.99% of confidence, that it covers
a given CJK language).

That cannot be done for the 8bit latin/cyrillic encodings because
there is too much overlapping between them (in the case of
iso-8859-1/iso-8859-15 the overlapping is of 97% for example).
While there is also a lot of overlapping between CJK encodings, there
are large plages of non overlaping chars, chars that appear only in
the japanese encoding, or only in gb2312, or only in big5 etc. (I mean
by only: not in any other widely used legacy encoding, so explicitely
excluding unicode that of course includes them all). As those exclusive
chars are numerous enough it is possbile to test for the presence of
some of them in a font and determine a language coverage from there.

Of course, complete checking can also be done, but I wonder if it is
actually useful (I mean, is there a font suitable for simplified chinese
out there that doesn't encode all the characters of gb2312? It would be   
like a font for English that is missing the r letter).
Big5 is a bit more problematic, as there is no such a thing as a well
defined Big5 encoding, but rather, in the pure Microsoftian tradition
(big5 comes after all from that side) a number of revisions all named
the same, that adds some characters, and an older font can miss some
chars that a newer one has (according to a newer definition of big5). 

But to handle such case, I think it would be better to choose a given
definition of big5 (or several of them) and stick to it, rather than
allowing a so tremendously big hole as 500 possible missing chars.

-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://chanae.stben.be/pablo/   PGP Key available, key ID: 0xD9B85466
[you can write me in Walloon, Spanish, French, English, Italian or Portuguese]
___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts

Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Yu Shao


Keith Packard wrote:

Around 14 o'clock on Jun 29, Yao Zhang wrote:

Sure, I will install as many Chinese fonts as possible and get the
fonts.cache for you.  But before that, I will show you serveral lines in my
fonts.cache:


I'm afraid the mailers corrupted the rather long lines in those files, but 
given that I've discovered that GB2312 is a relatively strong test for 
suitability for simplified chinese, perhaps we can avoid sending this data 
at all.

Now for lang, ZYSong18030 is labelled as
lang=simplifiedchinese
while SimSun-18030 is labelled as

lang=latin1,arabic,simplifiedchinese,koreanwansung,traditionalchinese,koreanjohab,arabic864,arabicasmo708,us


These language tags come from the OS/2 table and are set by the font 
designer.  If, as our friend Jungshik Shin says, simplified forms were
not unified with traditional forms in the BMP, then it's quite reasonable 
to build a font that can cover both languages.

Although both zysong and simsun are both from Beijing Zhongyi, but 
zysong in Red Hat 7.3 is purely a GB18030 font file, it only contains 
the characters defined in GB18030 standard. And simsun  does provide 
extra characters to support other language like japanese etc. So the os2 
table says so.

Regards,

Shao


With the new improved GB2312-based simplified test, I suspect the correct 
languages would be generated automatically from this font as well.

I've gone ahead and committed the changes necessary for automatic lang 
determination to XFree86 CVS; those interested in verifying it's 
sensitivity and specificity are welcome to check it out and run:

   $ FC_DEBUG=256 fc-cache -f

This will display the number of missing glyphs in each language for each 
font and also display errors in the lang value relative to that specified 
in the TrueType file.

Keith PackardXFree86 Core TeamHP Cambridge Research Lab


___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts




___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts

Re: [Fonts]Automatic 'lang' determination

Re: [Fonts]Automatic 'lang' determination

Re: [Fonts]Automatic 'lang' determination

Re: [Fonts]Automatic 'lang' determination

Re: [Fonts]Automatic 'lang' determination

Re: [Fonts]Automatic 'lang' determination

Re: [Fonts]Automatic 'lang' determination

Re: [Fonts]Automatic 'lang' determination

Re: [Fonts]Automatic 'lang' determination

Re: [Fonts]Automatic 'lang' determination

Re: [Fonts]Automatic 'lang' determination

Re: [Fonts]Automatic 'lang' determination

Re: [Fonts]Automatic 'lang' determination

13 matches

Site Navigation

Mail list logo

Footer information