--- Keith Packard <[EMAIL PROTECTED]> wrote: > > Around 4 o'clock on May 30, > =?iso-8859-1?q?Andrew=20Dunbar?= wrote: > > > > - The set of languages in the OS/2 table / > FC_LANG > > > is pitfully > > > > Can't you use coverage to determine this? > > Not easily. Traditional Chinese, simplified > Chinese, Japanese and Korean > fonts cover the same Unicode regions, and fonts for > all of these languages > generally cover only a fraction of the total space > making any coverage > based language tag only a guess at best. In > particular, we'd need to call > upon an expert in the area of the two Chinese > varients to get an idea if > there were any codepoints distinguishing the two.
Well yes and no. Korean uses Traditional Chinese style so it's safer to mix those two. Simplified and traditional use mostly similar styles and I'm not aware of any codepoint that needs to be rendered differently for each language as the two versions all have seperate codepoints. But mixing a Japanese and a Chinese/Korean style generally offends somebody. The usual example is U+6D77 which has a different stroke count for Japanese vs the others. Stroke count is an important property in CJK as it has a role to play in using dictionaries and people are sensitive to this. You can see that the Chinese versions have two "dots" in the middle of the grid whereas the Japanese version has a vertical bar. Apparently this is the kind of thing the Japanese dislike about Unicode: http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=6d77 Also Japanese generally seem to prefer a "sans serif" look while the Chinese prefer a more "caligraphic" or "serifed" look and these don't mix and match well. In Japan there is a set of 1,850 characters that everybody has to know by the end of high-school. There's probably an equivalent for the Chineses. Somebody knowlegable could probably build a table useful for heuristics or you could do a frequency count using web pages and make a table from that. Anyway I think if we can make an educated guess at low computational cost it will hopefully be better than nothing? > > For now yes. Romanian uses a "comma below" some > letters which Unicode has > > mapped onto a cedilla. > > This is a minor issue by comparison, but the same > basic problem. We'll > see if people using that language start to rise up > in revolt as the Han > language groups have, then we can start looking for > yet another kludge. I investigated further last night and it seems these characters have been awarded separate codepoints after all. I'm pretty sure there will be new cases in the Indic ranges where Unicode recommends using codepoints from Devanagari for various symbols in the other scripts but these are hardly used yet. Andrew Dunbar. > Keith Packard XFree86 Core Team HP > Cambridge Research Lab > > ===== http://linguaphile.sourceforge.net http://www.abisource.com __________________________________________________ Do You Yahoo!? Everything you'll ever need on one web page from News and Sport to Email and Music Charts http://uk.my.yahoo.com _______________________________________________ Fonts mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/fonts