Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Yu Shao

Keith Packard wrote:

>Around 14 o'clock on Jun 29, Yao Zhang wrote:
>
>>Sure, I will install as many Chinese fonts as possible and get the
>>fonts.cache for you.  But before that, I will show you serveral lines in my
>>fonts.cache:
>>
>
>I'm afraid the mailers corrupted the rather long lines in those files, but 
>given that I've discovered that GB2312 is a relatively strong test for 
>suitability for simplified chinese, perhaps we can avoid sending this data 
>at all.
>
>>Now for lang, ZYSong18030 is labelled as
>>lang=simplifiedchinese
>>while SimSun-18030 is labelled as
>>
>lang=latin1,arabic,simplifiedchinese,koreanwansung,traditionalchinese,koreanjohab,arabic864,arabicasmo708,us
>>
>
>These language tags come from the OS/2 table and are set by the font 
>designer.  If, as our friend Jungshik Shin says, simplified forms were
>not unified with traditional forms in the BMP, then it's quite reasonable 
>to build a font that can cover both languages.
>
Although both zysong and simsun are both from Beijing Zhongyi, but 
zysong in Red Hat 7.3 is purely a GB18030 font file, it only contains 
the characters defined in GB18030 standard. And simsun  does provide 
extra characters to support other language like japanese etc. So the os2 
table says so.

Regards,

Shao

>
>With the new improved GB2312-based simplified test, I suspect the correct 
>languages would be generated automatically from this font as well.
>
>I've gone ahead and committed the changes necessary for automatic lang 
>determination to XFree86 CVS; those interested in verifying it's 
>sensitivity and specificity are welcome to check it out and run:
>
>   $ FC_DEBUG=256 fc-cache -f
>
>This will display the number of missing glyphs in each language for each 
>font and also display errors in the lang value relative to that specified 
>in the TrueType file.
>
>Keith PackardXFree86 Core TeamHP Cambridge Research Lab
>
>
>___
>Fonts mailing list
>[EMAIL PROTECTED]
>http://XFree86.Org/mailman/listinfo/fonts
>



___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts



[Fonts]Re: Han unification(SC and TC)(was..Re: Automatic 'lang' determination )

2002-06-29 Thread John H. Jenkins


On Saturday, June 29, 2002, at 12:31 PM, Jungshik Shin wrote:

>   I'm afraid what you have heard of BMP section is misleading if
> I understood you correctly. Whether in BMP or not, simplified forms of
> Chinese characters are NOT UNIFIED with traditional forms of Chinese
> characters. (let me copy my message to John H. Jenkins @Apple who knows a
> lot more about Han Unification than I do.)

This is correct.  The interconversion between SC and TC is in general 
m-to-n, and so unification would not have been possible.  Where a 
character is simply "written differently" in the PRC from Taiwan and 
elsewhere, they are unified (e.g., U+988A), and where an already extant 
character is used as a simplification for another, the older character and 
the simplified character are unified (e.g., U+53F0, which is both a TC 
character in its own right and the simplification for other characters, 
such as U+98B1).  This is done, however, only because the SC form is seen 
as separate from its TC counterpart(s).

> AFAIK, most complaints about
> Han unification does NOT come from zh-CN vs zh-TW BUT from zh-CN/zh-TW
> vs ja. For Han characters common in both zh-CN and zh-TW, there's no
> significant difference in appearence between zh-CN and zh-TW.

Actually, there are some exceptions to this.  U+988A and characters 
containing it make up the bulk of this.  In general, however, you're quite 
correct.

> Although
> many Japanese would not agree with me, I don't think there's any
> significant difference across CJKV.

Also correct.  It's on the order of "color" vs. "colour".  In the bulk of 
the cases which have been unified, all the unified forms will be 
recognized by native readers of all the languages involved, even if they 
may look a little "funny."

> (again, ISO 10646 Han chart is a
> good reference along with ROC MOE's Han character variant dictionary at
> http://140.111.1.40) To me, Han Unification should have gone further (not
> less) in a sense and it's worrisome to me that non-BMP includes too many
> glyph variants (a whole bunch of them coming from Korean Buddist text :
> see http://www.sutra.re.kr)  that should have been unified in my eyes.
>

*sigh*  This is also true.  We should have pushed harder on the IRG during 
the Extension B work to keep this very thing from happening.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/

___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts



Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Keith Packard


Around 1 o'clock on Jun 30, Pablo Saratxaga wrote:

> What are those glyphs? (I'm quite surprised, I would have expected the
> opposite: fonts generally have more glyphs than the standard encodings of
> the sio-8859 family for example)

My definition of language tag is coloured by the OS/2 table codePageRange 
bits from which is was originally defined in fontconfig.  Those bits are 
defined to map to specific Windows code pages; the Latin-1 case doesn't 
map to ISO 8859-1, but rather to code page 1252 for which many fonts are 
missing a few random entries.

Similarly for the other tags, the existing fonts that I have don't 
generally seem to cover the complete windows code page from which the 
codePageRange bit was derived.

> No, the tolerance for missing glyphs in CJK tests should be the same or
> even smaller. The difference is that it isn't needed to test all the glyphs
> for CJK coverages; testing only a set of 256 choose glyphs would be enough
> (if they are correctly choosen, testing that 256 glyphs are present in a
> font is enough to assure, with 99.99% of confidence, that it covers a given
> CJK language).

I'm not confident enough of this approach; I fear that any set of 256 
glyphs that must appear in a simplified Chinese font may well appear in 
many traditional Chinese (or even Japanese) fonts.  

Certainly we could experimentally determine a reasonable subset, and it's 
completely trivial to change the matching table used in the code.

> Of course, complete checking can also be done, but I wonder if it is
> actually useful (I mean, is there a font suitable for simplified chinese
> out there that doesn't encode all the characters of gb2312?

It seems that this must be the case -- I set the '500' number so high 
because all of the fonts which I have that advertise support for 
simplified Chinese are missing over 200 glyphs from GB2312.  I got
similar results for Japanese fonts, Korean Wansung fonts and traditional 
Chinese fonts.

I would need a significantly larger set of fonts than I currently have 
access to if I wanted to generate smaller test char sets.  Now that the 
tests stand in isolation, perhaps those skilled with particular languages 
can develop more specific tests.

> But to handle such case, I think it would be better to choose a given
> definition of "big5" (or several of them) and stick to it, rather than
> allowing a so tremendously big hole as 500 possible missing chars.

Missing 500 from a repertoire of nearly 2 doesn't seem to render most 
of these fonts unusable.

Keith PackardXFree86 Core TeamHP Cambridge Research Lab



___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts



Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Pablo Saratxaga

Kaixo!

On Sat, Jun 29, 2002 at 01:32:36PM -0700, Keith Packard wrote:

> These language tags come from the OS/2 table and are set by the font 
> designer.  If, as our friend Jungshik Shin says, simplified forms were
> not unified with traditional forms in the BMP, then it's quite reasonable 
> to build a font that can cover both languages.

Yes, in fact only "zh" would be enough as a language tag.
There are real differences in typographic traditions between Chinese
and Japanese, so even when viewing a same character you can in some cases
tell if it has been extracted from a Chinese or a Japanese publication.
The differences between traditional/simplified aside, I don't think
there are typographic traditions differences between zh_CN and zh_TW;
it is possible to design a typeface suitable for both.
It is not possible to design a typeface suitable for ja and zh.

The difference between zh_CN and zh_TW as language tags is however useful,
because a big amount of fonts only cover one of the two sets.


-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://chanae.stben.be/pablo/   PGP Key available, key ID: 0xD9B85466
[you can write me in Walloon, Spanish, French, English, Italian or Portuguese]
___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts



Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Pablo Saratxaga

Kaixo!

On Sat, Jun 29, 2002 at 01:20:34PM -0700, Keith Packard wrote:
 
> > A font is suited for a given language when it covers *ALL* of the codepoints
> > needed for that language.
> 
> Yes, that's obviously true, but the problem is that I don't have tables for
> each language indicating the required codepoints, all I have are tables
> listing Unicode values in encodings traditionally used for each language.
> These tables almost always include a few (1-5) glyphs which many fonts are
> missing.

What are those glyphs?
(I'm quite surprised, I would have expected the opposite: fonts generally
have more glyphs than the standard encodings of the sio-8859 family
for example)

>> So, the tests for CJK languages and for other languages are clearly different,
>> only CJK languages can go with testing only a "signifiant fraction",
>> for all other languages all chars must be tested.
> 
> Yes, the tolerance value given for the Han languages is 500 codepoints 
> while the value for non-Han languages is two orders of magnitude smaller.

No, the tolerance for missing glyphs in CJK tests should be the
same or even smaller.
The difference is that it isn't needed to test all the glyphs for CJK
coverages; testing only a set of 256 choose glyphs would be enough
(if they are correctly choosen, testing that 256 glyphs are present in
a font is enough to assure, with 99.99% of confidence, that it covers
a given CJK language).

That cannot be done for the 8bit latin/cyrillic encodings because
there is too much overlapping between them (in the case of
iso-8859-1/iso-8859-15 the overlapping is of 97% for example).
While there is also a lot of overlapping between CJK encodings, there
are large plages of non overlaping chars, chars that appear only in
the japanese encoding, or only in gb2312, or only in big5 etc. (I mean
by "only": "not in any other widely used legacy encoding", so explicitely
excluding unicode that of course includes them all). As those "exclusive"
chars are numerous enough it is possbile to test for the presence of
some of them in a font and determine a language coverage from there.

Of course, complete checking can also be done, but I wonder if it is
actually useful (I mean, is there a font suitable for simplified chinese
out there that doesn't encode all the characters of gb2312? It would be   
like a font for English that is missing the "r" letter).
"Big5" is a bit more problematic, as there is no such a thing as a well
defined "Big5" encoding, but rather, in the pure Microsoftian tradition
(big5 comes after all from that side) a number of revisions all named
the same, that adds some characters, and an older font can miss some
chars that a newer one has (according to a newer definition of "big5"). 

But to handle such case, I think it would be better to choose a given
definition of "big5" (or several of them) and stick to it, rather than
allowing a so tremendously big hole as 500 possible missing chars.

-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://chanae.stben.be/pablo/   PGP Key available, key ID: 0xD9B85466
[you can write me in Walloon, Spanish, French, English, Italian or Portuguese]
___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts



Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Keith Packard


Around 14 o'clock on Jun 29, Yao Zhang wrote:

> Sure, I will install as many Chinese fonts as possible and get the
> fonts.cache for you.  But before that, I will show you serveral lines in my
> fonts.cache:

I'm afraid the mailers corrupted the rather long lines in those files, but 
given that I've discovered that GB2312 is a relatively strong test for 
suitability for simplified chinese, perhaps we can avoid sending this data 
at all.

> Now for lang, ZYSong18030 is labelled as
> lang=simplifiedchinese
> while SimSun-18030 is labelled as
> 
>lang=latin1,arabic,simplifiedchinese,koreanwansung,traditionalchinese,koreanjohab,arabic864,arabicasmo708,us

These language tags come from the OS/2 table and are set by the font 
designer.  If, as our friend Jungshik Shin says, simplified forms were
not unified with traditional forms in the BMP, then it's quite reasonable 
to build a font that can cover both languages.

With the new improved GB2312-based simplified test, I suspect the correct 
languages would be generated automatically from this font as well.

I've gone ahead and committed the changes necessary for automatic lang 
determination to XFree86 CVS; those interested in verifying it's 
sensitivity and specificity are welcome to check it out and run:

$ FC_DEBUG=256 fc-cache -f

This will display the number of missing glyphs in each language for each 
font and also display errors in the lang value relative to that specified 
in the TrueType file.

Keith PackardXFree86 Core TeamHP Cambridge Research Lab


___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts



Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Keith Packard


Around 20 o'clock on Jun 29, Pablo Saratxaga wrote:

> A font is suited for a given language when it covers *ALL* of the codepoints
> needed for that language.

Yes, that's obviously true, but the problem is that I don't have tables for
each language indicating the required codepoints, all I have are tables
listing Unicode values in encodings traditionally used for each language.
These tables almost always include a few (1-5) glyphs which many fonts are
missing.

So, the test is to require that the number of missing glyphs for non-Han 
languages is very small (<8) to allow fonts which happen to be missing 
only a few unimportant glyphs to be used.  Discovering which glyphs in 
each encoding are problematic in many fonts would allow this fudge factor 
to be reduced further.

> So, the tests for CJK languages and for other languages are clearly different,
> only CJK languages can go with testing only a "signifiant fraction",
> for all other languages all chars must be tested.

Yes, the tolerance value given for the Han languages is 500 codepoints 
while the value for non-Han languages is two orders of magnitude smaller.

Keith PackardXFree86 Core TeamHP Cambridge Research Lab


___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts



Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Keith Packard


Around 13 o'clock on Jun 29, Yao Zhang wrote:

>   if (covers_much_of (gb18030))
>   font supports simplified Chinese
>   if (covers_almost_all_of (Big5))
>   font supports traditional Chinese
>   font does not support simplified Chinese
> 
> For a GB18030 font, since it covers much of GB18030 set, it suports
> simplified Chinese.  And is also covers almost all of BIG5, so it
> supports traditional Chinese too.  But now the algorithm excludes it
> from simplified Chinese support.  The last line is wrong.

Yes, I think the problem is that I'm using GBK for the test instead of
GB2312 -- I got the simplified coverage information from codepage 936 which
is based on GBK.

The fonts I have don't cover most of GBK, but do cover nearly all of 
GB2312.  

>   if (covers_almost_all_of (GB2312))
>   font supports SIMPLIFIED Chinese
>   if (covers_almost_all_of (Big5))
>   font supports traditional Chinese

Thanks, this works just fine.  I'm much happier with this solution.

Keith PackardXFree86 Core TeamHP Cambridge Research Lab


___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts



Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Jungshik Shin

On Sat, 29 Jun 2002, Yao Zhang wrote:

> It should be
>
>   if (covers_almost_all_of (GB2312))
>   font supports SIMPLIFIED Chinese
>   if (covers_almost_all_of (Big5))
>   font supports traditional Chinese

  After sending my prev. message, I read this and I have to
agree with this. This is better than what I sent earlier.  Just forgetting
about GB18030/GBK coverage and concentrating on GB2312 and Big5 coverage
is simpler as well as better.

  Jungshik Shin

___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts



[Fonts]Han unification(SC and TC)(was..Re: Automatic 'lang' determination)

2002-06-29 Thread Jungshik Shin

On Sat, 29 Jun 2002, Keith Packard wrote:

Ooops. My message crossed yours in mail :-)

> Around 9 o'clock on Jun 29, Jungshik Shin wrote:

> > IMHO, most problems with Han Unification arise not from using a _single_
> > font targeted at one of zh_TW/zh_CN/ja/ko to render a run of text in
> > another but from mixing _multiple_ fonts (with _drastically different_
> > design principle and other differences like baseline) to render a single
...

> Yes, I agree -- this is true in Western languages as well where the


  We agree with each other on this point, but still get to different
conclusions about zh-CN and zh-TW. I'm afraid that's because you have
been misinformed about what Han unification has done about simplified
forms and traditional forms of Chinese characters.


> > Suppose there's a document tagged as zh_TW that explains how PRC government
> > simplified Chinese characters to boost the literacy rate after WW II. If a
> > Big5 font (that doesn't cover all characters in the doc) is selected
> > instead of a GBK/GB18030 font (with the full coverage), simplified Han
> > characters(not used in Taiwan but only used in PRC) in the doc have to be
> > rendered with another font (most likely GB2312/GBK/GB18030 font).
>
> A correct version of this document would tag individual sections of the
> document with appropriate tags.  This way, the zh_TW sections could be
> presented in a traditional Chinese font while the mainland portions are
> displayed with simplified Chinese glyphs.

  Well, even without language tagging, that would happen, which
I regard as _ugly_ for the reason I gave in my previous message.
Language tag or not, the result would be just as ugly as using TimesRoman
Latin-1 font for most characters with a couple of characters rendered with
Palatino Latin-2 font.  My hypothetical document would not have separate
sections for zh-TW and zh-CN, but rather occasional simplified forms of
Chinese characters (absent in Big5 fonts but present in GB2312/GBK/GB18030
fonts) would pop up among traditional forms of Chinese characters
(present in _both_ Big5 font and GBK/GB18030 fonts).

  IMHO, tagging the whole document as 'zh-TW' is perfectly valid
and rendering it with GBK/GB18030 (with the full coverage of characters
in the document) is better than mixing two fonts, one with Big5 coverage
and the other with GBK/GB18030 coverage. The latter would happen if you
exclude GBK/GB 18030 fonts for zh-TW text rendering.

  Tagging individual simplified forms of Chinese characters
with 'lang=zh-CN' in the sea of traditional forms of Chinese characters
would only lead to a less-desirable result than otherwise possible.


> >  I'm not sure what you meant by 'glyph forms are more likely
> > simplified'. You might have misunderstood some aspects of Han Unification
> > in Unicode/10646.  In Unicode, simplified forms of Chinese characters are
> > NOT unified with corresponding traditional forms of Chinese characters.
>
> You're right -- I didn't believe this to be the case.  I had heard that the
> unified portion within the BMP do co-mingle simplified and traditional
> forms, but that the non-BMP Han extension provide separate codepoints for
> each.

  I'm afraid what you have heard of BMP section is misleading if
I understood you correctly. Whether in BMP or not, simplified forms of
Chinese characters are NOT UNIFIED with traditional forms of Chinese
characters. (let me copy my message to John H. Jenkins @Apple who knows a
lot more about Han Unification than I do.)  AFAIK, most complaints about
Han unification does NOT come from zh-CN vs zh-TW BUT from zh-CN/zh-TW
vs ja. For Han characters common in both zh-CN and zh-TW, there's no
significant difference in appearence between zh-CN and zh-TW. Although
many Japanese would not agree with me, I don't think there's any
significant difference across CJKV.  (again, ISO 10646 Han chart is a
good reference along with ROC MOE's Han character variant dictionary at
http://140.111.1.40) To me, Han Unification should have gone further (not
less) in a sense and it's worrisome to me that non-BMP includes too many
glyph variants (a whole bunch of them coming from Korean Buddist text :
see http://www.sutra.re.kr)  that should have been unified in my eyes.

> If even BMP codepoints are separate,
> then it should be possible to create
> a large set of codepoints which could mark fonts as suitable for the
> display of simplified Chinese which are distinct from the set of
> codepoitns suitable for the display of traditional Chinese.   That would
> be nicer than my current kludge of marking any font suitable for
> traditional chinese as unsuitable for simplified Chinese.

How about this?

   if covers most of GB 18030
  good for both zh-CN and zh-TW
  (and possibly good for ko)
   elif covers most of GBK
  good for both zh-CN and zh-TW
  (and possibly good for ko)
  not good for ja
   elif covers most of Big5,
  good for zh-TW
  (and possibly good for ko)
  

Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Yao Zhang

Keith Packard wrote:

> Actually, I could really use as many Han fonts as you have, especially if
> they are from different vendors and of different ages.  All I really need
> is the fonts.cache files generated from these fonts; that holds the unicode
> coverage and any OS/2 table information.  That would be a lot smaller, and
> also avoid any copyright or trade secret problems.

Sure, I will install as many Chinese fonts as possible and get the
fonts.cache for you.  But before that, I will show you serveral lines in my
fonts.cache:

"/usr/share/fonts/zh_CN/TrueType/zysong.ttf" 0 1017360509 
"ZYSong18030:style=regular:slant=0:weight=100:index=0:outline=True:scalable=True:charset=:lang=simplifiedchinese"
"/usr/share/fonts/zh_CN/TrueType/SimSun18030.ttc" 0 1021954464 
"SimSun\\-18030:style=regular:slant=0:weight=100:spacing=100:index=0:outline=True:scalable=True:charset=
  |>^1!|>^1!P0oWQ |>^1!|>^1!|>^1%#$XIJ7!!7K/!#@#g!BBH1!!K?& )rmR!!^^7$!!!)$  
!!71$$  9;+63 !!!.%|>J~~|>K0}!!!0~ !!!1&|>T)$|>^1!!!B7$ 
!!!7)RfF}m#|7NW!!!?*;5CsY!BB.k9WOSb!%TBD !!!T4|>^1!|>^1!|>^+~|>K?){{7T3q~Ki]!!(bt 
!!!r?#?7uT|>^1!|>^1!!BB.!|>^11%  !!#0GMHs3p&VK   |;y1s(1+e4  
 !!#AL|>^1!|>^1!|>T^4!#f04!)*$a4LXyi!!*.[f!!#DM!!!*2 ( !!#]U
!2bz#$oxJj!!!1& !!#bV   
(0~]4!!#eWF3y>z9WIxl|>^0~|>^1!MX|rY|>^0~|>^1!K2Fxo!!#hX|>^0^!!!1%  !!#kY !!7?(   
+  !!#nZJ~mcX$!&){H 
!!#q[|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!#t]|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!#w^|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!#za|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!#}b|>^1!|>^1!|>^1!|>^!
1!|>^1!|>^1!|>^1!|>^1!!!$#c|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$&d|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$*e|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$/f|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$2g|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$5h|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$8i|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$@k|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Cl|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Fm|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$In|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Lo|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Op|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Rq|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Ur|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Xs|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$[t|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$au|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$dv|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$gw|>^1!|>^1!|>^1!|>^1!|>^1!!)pSi
  !!$jx|>^1!|>^1!|>^1!|>^1!|>^1!|>!
^1!|>^1!|>^1!!!$my|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$pz|
>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$s{|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$v||>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$y}|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$|~|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%#!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%&#|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%*$|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%/%|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%2&|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%5(|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%8)|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%<*|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%@+|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%C.|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%F/|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%I0|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%L1|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%O2|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%R3|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%U4|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%X5|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%[6|>!
^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%a7|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%d8|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%g9|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%j;|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%m<|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%p>|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%s?|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%v@|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%yA|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%|B|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&!C|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&%D|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&)E|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&.F|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&1G|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&4H|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&7I|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&;J|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&?K|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&BL|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&EM|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&HN|>^!
1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&KO|>^1!|>^1!|>^1!|>^1!|>
^1!|>^1!|>^1!|>^1!!!&NP|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&QQ|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&TR|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&WS|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&ZT|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&^U|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>

Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Yao Zhang

Keith Packard wrote:

> Actually, I could really use as many Han fonts as you have, especially if
> they are from different vendors and of different ages.  All I really need
> is the fonts.cache files generated from these fonts; that holds the unicode
> coverage and any OS/2 table information.  That would be a lot smaller, and
> also avoid any copyright or trade secret problems.

Sure, I will install as many Chinese fonts as possible and get the
fonts.cache for you.  But before that, I will show you serveral lines in my
fonts.cache:

"/usr/share/fonts/zh_CN/TrueType/zysong.ttf" 0 1017360509 
"ZYSong18030:style=regular:slant=0:weight=100:index=0:outline=True:scalable=True:charset=:lang=simplifiedchinese"
"/usr/share/fonts/zh_CN/TrueType/SimSun18030.ttc" 0 1021954464 
"SimSun\\-18030:style=regular:slant=0:weight=100:spacing=100:index=0:outline=True:scalable=True:charset=
  |>^1!|>^1!P0oWQ |>^1!|>^1!|>^1%#$XIJ7!!7K/!#@#g!BBH1!!K?& )rmR!!^^7$!!!)$  
!!71$$  9;+63 !!!.%|>J~~|>K0}!!!0~ !!!1&|>T)$|>^1!!!B7$ 
!!!7)RfF}m#|7NW!!!?*;5CsY!BB.k9WOSb!%TBD !!!T4|>^1!|>^1!|>^+~|>K?){{7T3q~Ki]!!(bt 
!!!r?#?7uT|>^1!|>^1!!BB.!|>^11%  !!#0GMHs3p&VK   |;y1s(1+e4  
 !!#AL|>^1!|>^1!|>T^4!#f04!)*$a4LXyi!!*.[f!!#DM!!!*2 ( !!#]U
!2bz#$oxJj!!!1& !!#bV   
(0~]4!!#eWF3y>z9WIxl|>^0~|>^1!MX|rY|>^0~|>^1!K2Fxo!!#hX|>^0^!!!1%  !!#kY !!7?(   
+  !!#nZJ~mcX$!&){H 
!!#q[|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!#t]|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!#w^|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!#za|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!#}b|>^1!|>^1!|>^1!|>^!
1!|>^1!|>^1!|>^1!|>^1!!!$#c|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$&d|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$*e|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$/f|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$2g|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$5h|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$8i|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$@k|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Cl|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Fm|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$In|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Lo|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Op|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Rq|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Ur|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$Xs|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$[t|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$au|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$dv|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$gw|>^1!|>^1!|>^1!|>^1!|>^1!!)pSi
  !!$jx|>^1!|>^1!|>^1!|>^1!|>^1!|>!
^1!|>^1!|>^1!!!$my|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$pz|
>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$s{|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$v||>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$y}|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!$|~|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%#!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%&#|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%*$|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%/%|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%2&|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%5(|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%8)|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%<*|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%@+|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%C.|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%F/|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%I0|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%L1|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%O2|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%R3|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%U4|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%X5|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%[6|>!
^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%a7|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%d8|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%g9|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%j;|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%m<|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%p>|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%s?|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%v@|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%yA|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!%|B|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&!C|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&%D|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&)E|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&.F|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&1G|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&4H|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&7I|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&;J|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&?K|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&BL|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&EM|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&HN|>^!
1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&KO|>^1!|>^1!|>^1!|>^1!|>
^1!|>^1!|>^1!|>^1!!!&NP|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&QQ|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&TR|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&WS|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&ZT|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!!!&^U|>^1!|>^1!|>^1!|>^1!|>^1!|>^1!|>

Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Pablo Saratxaga

Kaixo!

On Sat, Jun 29, 2002 at 09:34:43AM -0700, Keith Packard wrote:
 
> This goal is reflected in the design I outlined -- fonts are deemed 
> "suitable" for a particular language when they cover a significant 
> fraction of the codepoints commonly associated with that language.

That is inacceptable.
A font is suited for a given language when it covers *ALL* of the codepoints
needed for that language.

The only exception in checking *all* of the needed codepoints is that
of CJK languages, that is because:
- there is a very small set of such languages
- the fonts are designed with coverage of one of them in mind
- the mandatory glyphs needed for a given CJK language that don't
  overlap with any other CJK language make a quit big set, allowing
  to test just a carefully chose and small set of glyphs, and assume
  that all other glyphs needed for a given CJK language are present too.

Maybe also scripts used for one and only one language can be handled
withotu the need to check all the needed codepoints (but on the other hand
they always form a small amount of codepoints, so checking them all is
not a problem)

But for the big majority of languages, that are not the only ones written 
with a given script, just checking coverage of a "signifiant fraction"
is not enough.

For example Spanish, it needs the a-z letters plus áéíóúüñ (that is, aacute,
eacute, iacute, oacute, uacute, udiaeresis and ntilde).
If only one of these is missing then you cannot render a Spanish text
correctly, even if out of the 66 chars (33 lowercase, 33 upercase) the
font covers 65 of them, it is still not suitable to properly render
Spanish text (it may get unnoticed if the text just happens to don't
use the missing letter, but relying in chance is not very serious)

So, the tests for CJK languages and for other languages are clearly different,
only CJK languages can go with testing only a "signifiant fraction",
for all other languages all chars must be tested.
 
> > Suppose there's a document tagged as zh_TW that explains how PRC government
> > simplified Chinese characters to boost the literacy rate after WW II. If a
> > Big5 font (that doesn't cover all characters in the doc) is selected
> > instead of a GBK/GB18030 font (with the full coverage), simplified Han
> > characters(not used in Taiwan but only used in PRC) in the doc have to be
> > rendered with another font (most likely GB2312/GBK/GB18030 font).
> 
> A correct version of this document would tag individual sections of the
> document with appropriate tags.  This way, the zh_TW sections could be
> presented in a traditional Chinese font while the mainland portions are
> displayed with simplified Chinese glyphs.

Indeed.

I wonder however how place names are handled. Are there place names with
names using hanzi that don't exist in simplified form ?
If so, what would be the preferred solution to write such a place name
in a simplified Chinese text ?
Same question for people names.

-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://chanae.stben.be/pablo/   PGP Key available, key ID: 0xD9B85466
[you can write me in Walloon, Spanish, French, English, Italian or Portuguese]
___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts



Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Yao Zhang

I wrote earlier:
 
> Actually, it is better changed to
>   if (covers_almost_all_of (GB2312))
>   font supports traditional Chinese
>   if (covers_almost_all_of (Big5))
>   font supports traditional Chinese
 
It should be

if (covers_almost_all_of (GB2312))
font supports SIMPLIFIED Chinese
if (covers_almost_all_of (Big5))
font supports traditional Chinese

Sorry about the typo.
___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts



Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Yao Zhang

From: Keith Packard <[EMAIL PROTECTED]>

> Around 22 o'clock on Jun 29, Yu Shao wrote:

> > >Tagging GB18030 fonts as suitable for traditional chinese seems like a 
> > >mistake; the glyph forms are more likely simplified, and it would be 
> > >
> > Agreed.

> This is reassuring.

No, this is not the case.

Let us use Unicode terms here, because those national standard are
missleading.  GB18030 is a PRC standard, but it doesn't means it
is for simplified Chinese.  Actually, all those fonts use Unicode
CMAP, so they are really Unicode font.

For Han characters, GB18030 covers CJK Unified Ideographs and
its extension A.  GBK covers CJK Unified Ideographs only.  Roughly
speaking, CJK Unified Ideographs covers both GB2312 and BIG5 character
set.  The simplifed and traditional forms are NOT unified.  So
both GBK and GB18030 fonts are suitable for simplified Chinese
and traditional Chinese.

No, the algorithm is not quite right:

if (covers_much_of (gb18030))
font supports simplified Chinese
if (covers_almost_all_of (Big5))
font supports traditional Chinese
font does not support simplified Chinese

For a GB18030 font, since it covers much of GB18030 set, it suports
simplified Chinese.  And is also covers almost all of BIG5, so it
supports traditional Chinese too.  But now the algorithm excludes it
from simplified Chinese support.  The last line is wrong.

Actually, it is better changed to
if (covers_almost_all_of (GB2312))
font supports traditional Chinese
if (covers_almost_all_of (Big5))
font supports traditional Chinese

Regards,

Yao Zhang
___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts



Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Jungshik Shin




On Sat, 29 Jun 2002, Jungshik Shin wrote:
> On Fri, 28 Jun 2002, Keith Packard wrote:

> > I'm confused by this; my exposure to Chinese fonts says that simplified
> > Chinese and traditional Chinese have significant overlap in Unicode
> > codepoints, but that the glyphs are quite a bit different in appearance.
>
>   I doubt this is the case. As far as I can tell

  I found this needs some clarification.  If glyphs of 'A', 'B'
and 'C' from Times Roman Latin-1  font are compared with corresponding
glyphs from New Century Schoolbook Latin-2 font, they look certainly
different. However, that does not mean that you cannot use Times Roman
Latin-1 font to render a run of text in one of languages Latin-2 is meant
for as long as Times-Roman Latin-1 font has _all_ the glyphs necessary in
that particular run of text.

  I believe the same thing can happen between two fonts for
zh-TW and zh-CN. If glyphs from font A for zh-TW are compared with glyphs
from font B (with different design principles) for zh-CN, they for sure
look different. However, they're different not because font A is for zh-TW
and font B is for zh-CN but because they're designed to appear different.

> > Chinese and traditional Chinese have significant overlap in Unicode
> > codepoints, but that the glyphs are quite a bit different in appearance.

  To make this kind of comparison meaningful, you have to compare
two fonts, one for zh-TW and the other for zh-CN, made by a _single_
foundry with the _identical_ design principles and look and feel
(something like Adobe Times Roman Latin-1 font and Adobe Times Roman
Latin-2 font).

  In practice, it's hard to find two fonts that satisfy the crieteria I
outlined here.  However, ISO 10646 code charts for Han characters should
do almost as good a job.  That's why I suggested comparing glyphs for
PRC and Taiwan in the ISO 10646 Han character chart.

   Jungshik Shin

___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts



Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Keith Packard


Around 9 o'clock on Jun 29, Jungshik Shin wrote:

> IMHO, most problems with Han Unification arise not from using a _single_
> font targeted at one of zh_TW/zh_CN/ja/ko to render a run of text in
> another but from mixing _multiple_ fonts (with _drastically different_
> design principle and other differences like baseline) to render a single
> run of text (say, 65% of characters drawn from one font, 25% from a second
> font, 7% from a third font, etc).

Yes, I agree -- this is true in Western languages as well where the 
application selects a font covering only Latin-1 and attempts to display 
text requiring glyphs from Latin-2; a "smart" application will locate an 
additional font to fill-in the missing glyphs, the result looks like a 
ransom note.

The hope is that proper language tags in the document can avoid this at 
the start by making the first font contain the proper coverage for the 
entire block of text.

This goal is reflected in the design I outlined -- fonts are deemed 
"suitable" for a particular language when they cover a significant 
fraction of the codepoints commonly associated with that language.

> Suppose there's a document tagged as zh_TW that explains how PRC government
> simplified Chinese characters to boost the literacy rate after WW II. If a
> Big5 font (that doesn't cover all characters in the doc) is selected
> instead of a GBK/GB18030 font (with the full coverage), simplified Han
> characters(not used in Taiwan but only used in PRC) in the doc have to be
> rendered with another font (most likely GB2312/GBK/GB18030 font).

A correct version of this document would tag individual sections of the
document with appropriate tags.  This way, the zh_TW sections could be
presented in a traditional Chinese font while the mainland portions are
displayed with simplified Chinese glyphs.

I don't know how prevalent language tagging is in office document formats, 
but it's certainly available in HTML.  It's the HTML case that started my 
journey into language tags.

>  I'm not sure what you meant by 'glyph forms are more likely
> simplified'. You might have misunderstood some aspects of Han Unification
> in Unicode/10646.  In Unicode, simplified forms of Chinese characters are
> NOT unified with corresponding traditional forms of Chinese characters.

You're right -- I didn't believe this to be the case.  I had heard that the
unified portion within the BMP do co-mingle simplified and traditional
forms, but that the non-BMP Han extension provide separate codepoints for
each.

If even BMP codepoints are separate, then it should be possible to create 
a large set of codepoints which could mark fonts as suitable for the 
display of simplified Chinese which are distinct from the set of 
codepoitns suitable for the display of traditional Chinese.   That would 
be nicer than my current kludge of marking any font suitable for 
traditional chinese as unsuitable for simplified Chinese.

Keith PackardXFree86 Core TeamHP Cambridge Research Lab


___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts



Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Keith Packard


Around 22 o'clock on Jun 29, Yu Shao wrote:

> >Tagging GB18030 fonts as suitable for traditional chinese seems like a 
> >mistake; the glyph forms are more likely simplified, and it would be 
> >
> Agreed.

This is reassuring.

> As gb18030 is compulsory from government, I think we should just treat 
> gb18030 as Simplified Chinese, and all fonts from now on should gb18030 
> compliant. For these fonts, the new included Chinese  minority Yi and 
> Tibeitan characters would do.

The trick that I use to distinguish between simplified Chinese and
traditional Chinese targeted fonts is not whether they cover a significant
fraction of the Unicode codepoints mapped from gb18030, but whether they
cover nearly all of the Unicode codepoints from mapped from Big5.

The algorithm looks like:

if (covers_much_of (gb18030))
font supports simplified Chinese
if (covers_almost_all_of (Big5))
font supports traditional Chinese
font does not support simplified Chinese
if (covers_almost_all_of (JIS))
font supports Japanese
font does not support simplified Chinese
if (covers_almost_all_of (Korean Wansung))
font supports Korean
font does not support simplified Chinese

Nearly all Han fonts cover as much of GB18030 as those targeted for 
simplified Chinese, but (in my limited sample) simplified Chinese fonts 
cover only a small fraction of all of the other Han encodings.  Except for
Arial Unicode, which covers all of the encodings nearly completely.  

Remember that this whole mess is only needed for fonts which don't have 
any OS/2 codePageRange bits set; the hope is that new fonts covering more 
of the Unicode range will be provided in TrueType or OpenType format so 
that this particular hack can be avoided.

> But the very popular Microsoft's Chinese simsun font now, is actually a 
> gbk font.

This is a TrueType font and so the above hacks don't apply.  Are there new 
GB18030 fonts being distributed in formats that don't include the OS/2 
codePageRange bits?

Keith PackardXFree86 Core TeamHP Cambridge Research Lab


___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts



Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Jungshik Shin




On Fri, 28 Jun 2002, Keith Packard wrote:

>
> Around 0 o'clock on Jun 29, Yao Zhang wrote:
>
> > A GB18030 font (covers CJK Unified Ideographs and its extension A in Unicode
> > terms) should really be labeled as
> > Simplified Chinese AND Traditional Chinese
> > while fonts with GB2312 coverage should be labeled as
> > Simplified Chinese
> > and BIG5 coverage should be labeled as
> > Traditional Chinese
>
> I'm confused by this; my exposure to Chinese fonts says that simplified
> Chinese and traditional Chinese have significant overlap in Unicode
> codepoints, but that the glyphs are quite a bit different in appearance.

  I doubt this is the case. As far as I can tell
from ISO 10646 (unlike Unicode, for a single Han character, ISO
10646 lists glyphs as _commonly_ used in PRC, Taiwan, Japan, ROK, and
Vietnam. ISO 10646:2 also lists DPRK glyphs),  characters common in
GB2312(SC) and Big5(TC)  do not have big enough difference (if there's
any difference at all) in glyphs to make using a _single_ font(say,
GB18030/GBK fonts) for both zh_CN and zh_TW undesirable. IMHO, most
problems with Han Unification arise not from using a _single_ font
targeted at one of zh_TW/zh_CN/ja/ko to render a run of text in another
but from mixing _multiple_ fonts (with _drastically different_ design
principle and other differences like baseline) to render a single run
of text (say, 65% of characters drawn from one font, 25% from a second
font, 7% from a third font, etc). I'm not saying there's no problem at
all using a TC font for Japanese text rendering. I'm well aware that
many Japanese don't like that. However, using GBK/GB18030 fonts for TC
should present much much less problem than that.

> I'm not interested in discovering which fonts can display a particular
> document; that's easily done with Unicode coverage.  What I'm interested in
> is selecting the font best suited for presenting data tagged for a
> particular language.

  I believe Yao's well aware of your interest here. What he meant is
that using GBK/GB18030 fonts for both SC and TC rendering is all right. It
could be even desirable in some cases. Suppose there's a document tagged
as zh_TW that explains how PRC government simplified Chinese characters to
boost the literacy rate after WW II. If a Big5 font (that doesn't cover
all characters in the doc) is selected instead of a GBK/GB18030 font
(with the full coverage), simplified Han characters(not used in Taiwan
but only used in PRC) in the doc have to be rendered with another font
(most likely GB2312/GBK/GB18030 font).  Even though font selection
routine does a pretty good job of picking two fonts(Big5 font and
GB2312/GBK/GB18030) with similar look and feel, there may be a subtle
but noticable difference between two. If GBK/GB18030 font is used to
render _all_ Han characters in the doc., this wouldn't be an issue and
the result would give a uniform and consistent look and feel.

> Tagging GB18030 fonts as suitable for traditional chinese seems like a
> mistake; the glyph forms are more likely simplified, and it would be
> preferable to use a traditional chinese font, if any is available.  Of

  I'm not sure what you meant by 'glyph forms are more likely
simplified'. You might have misunderstood some aspects of Han Unification
in Unicode/10646.  In Unicode, simplified forms of Chinese characters are
NOT unified with corresponding traditional forms of Chinese characters.
If GB2312 and Big5 have some characters in common, that's because PRC
didn't simplify them and just decided to use traditional forms.

  Jungshik Shin

___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts



Re: [Fonts]Automatic 'lang' determination

2002-06-29 Thread Yu Shao

Keith Packard wrote:

>Around 0 o'clock on Jun 29, Yao Zhang wrote:
>
>>A GB18030 font (covers CJK Unified Ideographs and its extension A in Unicode
>>terms) should really be labeled as
>>Simplified Chinese AND Traditional Chinese
>>while fonts with GB2312 coverage should be labeled as
>>Simplified Chinese
>>and BIG5 coverage should be labeled as
>>Traditional Chinese
>>
>
>I'm confused by this; my exposure to Chinese fonts says that simplified 
>Chinese and traditional Chinese have significant overlap in Unicode 
>codepoints, but that the glyphs are quite a bit different in appearance.
>
>I'm not interested in discovering which fonts can display a particular
>document; that's easily done with Unicode coverage.  What I'm interested in
>is selecting the font best suited for presenting data tagged for a
>particular language.
>
>Tagging GB18030 fonts as suitable for traditional chinese seems like a 
>mistake; the glyph forms are more likely simplified, and it would be 
>
Agreed.

>
>preferable to use a traditional chinese font, if any is available.  Of 
>course, when no traditional chinese font is present, the system will 
>search for *any* font which does cover those codepoints, substituting in 
>an available simplified chinese font.
>
>I believe I've found a relatively robust way of distinguishing fonts 
>designed for traditional chinese from those designed for simplified 
>chinese; traditional chinese fonts cover most of BIG5 while simplified
>chinese fonts don't.  Both cover similar amounts of GB18030; as you say, 
>that encoding is enormous.  
>
>What I didn't investigate is whether the simplified chinese fonts cover
>*different* parts of GB18030 than the traditional fonts.  That might make
>
As gb18030 is compulsory from government, I think we should just treat 
gb18030 as Simplified Chinese, and all fonts from now on should gb18030 
compliant. For these fonts, the new included Chinese  minority Yi and 
Tibeitan characters would do.

But the very popular Microsoft's Chinese simsun font now, is actually a 
gbk font.

>
>the determination easier; simply use the subset of GB18030 normally needed
>to present simplified chinese documents as the touchstone instead of the
>whole encoding.  For that to work, I'd need a lot more simplified chinese
>fonts from various vendors.
>
>>If you need those fonts for testing, I will send you one typical font
>>in each category (They are huge, at lease several MB in size).  For
>>example,
>>
>
>Actually, I could really use as many Han fonts as you have, especially if
>they are from different vendors and of different ages.  All I really need
>is the fonts.cache files generated from these fonts; that holds the unicode
>coverage and any OS/2 table information.  That would be a lot smaller, and
>also avoid any copyright or trade secret problems.
>
>Keith PackardXFree86 Core TeamHP Cambridge Research Lab
>
>
>___
>Fonts mailing list
>[EMAIL PROTECTED]
>http://XFree86.Org/mailman/listinfo/fonts
>



___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts



Re: [Fonts]FreeType 2 backend for the masses

2002-06-29 Thread Juliusz Chroboczek

JP> [...] either the instructions or the sources should be adapted.

I was only checking if anyone's paying attention ;-)

Nice to see you back, Joerg.

Juliusz
___
Fonts mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/fonts



[Fonts]i18n fixes in Xlib, A.1141 improvement

2002-06-29 Thread Olivier Chapuis

Hello,

The attached patch improve my submission to <[EMAIL PROTECTED]> with
sequence number A.1141 (and subject:)

The new code vs A.1141 add one line of code + some comments which
answer a question of O. Taylor about adding a "break" in the code.
I explain why this "break" should be added (Moreover, this fix a
memory leak and fix slow font loading with a base font name with
a long list of fonts).

I cc this mail to [EMAIL PROTECTED] as I think that some part
of the Xlib i18n code need a real clean up (the patch contains
only fixes). I would like to know if someone maintains or works on
this part of the Xlib code and if I have to start this clean up.
Probably an other mailing list should be use but as I am not
in the xfree86 team (first fix send) I cc to this mailing list.

Here a full change log:

* xc/lib/X11/omGeneric.c (destroy_fontdata):
Free a XFontStruct which should be but was not

* xc/lib/X11/omGeneric.c (parse_vw):
(parse_fontname):
Fixed minor memory leaks

* xc/lib/X11/omGeneric.c (parse_fontname):
break when a match is found

* xc/lib/X11/lcFile.c (_XlcLocaleDirName):
Fixed minor memory leaks

Regards, Olivier

PS: - patch done in xc/lib/X11 with cvs diff -u
- Do not be afraid my C is better than my English
- Please apply the patch it fixes quite dramatic bugs IMHO

Index: lcFile.c
===
RCS file: /cvs/xc/lib/X11/lcFile.c,v
retrieving revision 3.26
diff -u -r3.26 lcFile.c
--- lcFile.c2002/05/31 18:45:42 3.26
+++ lcFile.c2002/06/29 07:55:43
@@ -421,12 +421,18 @@
   sprintf(buf, "%s/locale.dir", target_dir);
   target_name = resolve_name(name, buf, RtoL);
 }
+if (name != NULL && name != lc_name) {
+   XFree(name);
+   name = NULL;
+}
 if (target_name != NULL) {
   char *p = 0;
   if ((p = strstr(target_name, "/XLC_LOCALE"))) {
*p = '\0';
break;
   }
+  XFree(target_name);
+  target_name = NULL;
 }
   }
   if (target_name == NULL) {
@@ -437,5 +443,8 @@
   strcpy(dir_name, target_dir);
   strcat(dir_name, "/");
   strcat(dir_name, target_name);
+  if (target_name != lc_name) {
+  XFree(target_name);
+  }
   return dir_name;
 }
Index: omGeneric.c
===
RCS file: /cvs/xc/lib/X11/omGeneric.c,v
retrieving revision 3.20
diff -u -r3.20 omGeneric.c
--- omGeneric.c 2001/04/05 17:42:26 3.20
+++ omGeneric.c 2002/06/29 07:55:48
@@ -1056,6 +1056,22 @@
 *
 * Owen Taylor <[EMAIL PROTECTED]> 12 Jul 2000
 */
+   /* The reason why this routine modifies font_data and has a
+* font_data_return is that if it is called with C_PRIMARY, then
+* font_data_return is used by the caller and with the others classes
+* font_data is used by the caller (font_data can be different
+* than font_data_return if we do not break here).
+* However, a close look at the code (e.g., the drawing funcs) shows
+* that breaking or not here change nothing!! 
+* So we should 'break' here and the code needs a clean-up (e.g.,
+* some FontStruct are loaded and _never_ used).
+* Hopefully this also fix a memory leak: if we do not break here
+* a found a match later font_data->xlfd_name is deferenced without
+* being freed. Finally, this speed up font loading.
+*
+* <[EMAIL PROTECTED]> 2002-06-29
+*/
+   break;
}
 
switch(class) {
@@ -1126,13 +1142,21 @@
 intret = 0, i = 0;
 
 if(vmap_num > 0) {
-   if(parse_fontdata(oc, font_set, vmap, vmap_num, name_list, count, C_VMAP) == 
-1)
+   if(parse_fontdata(oc, font_set, vmap, vmap_num, name_list, count,
+ C_VMAP, &font_data_return) == -1) {
+   if(font_data_return.xlfd_name != NULL)
+XFree(font_data_return.xlfd_name);
return (-1);
+   }
+   if(font_data_return.xlfd_name != NULL)
+   XFree(font_data_return.xlfd_name);
 }
 
 if(vrotate_num > 0) {
ret = parse_fontdata(oc, font_set, (FontData) vrotate, vrotate_num,
 name_list, count, C_VROTATE, &font_data_return);
+   if(font_data_return.xlfd_name != NULL)
+   XFree(font_data_return.xlfd_name);
if(ret == -1) {
return (-1);
} else if(ret == False) {
@@ -1168,6 +1192,8 @@
 
ret = parse_fontdata(oc, font_set, (FontData) vrotate, vrotate_num,
 name_list, count, C_VROTATE, &font_data_return);
+   if(font_data_return.xlfd_name != NULL)
+   XFree(font_data_return.xlfd_name);
if(ret == -1)
return (-1);
}
@@ -1237,6 +1263,7 @@
font_set->side = font_data_return.side;
 
 Xfree (font_data_