I've decided that using RFC 3066 to indicate font language coverage is a good idea (or at least the best idea). Owen Taylor is partially responsible as he's using it in Pango; the realization that HTML also uses this RFC for language tagging documents makes it pretty clear that I could do a lot worse.
RFC 3066 uses ISO 639 language codes and combines them with ISO 3166 country codes -- you're probably familiar with these as a part of locale names (e.g. en-US) My plan is to have fonts advertise the complete set of languages that they cover, and then to allow them to further distinguish languages with country codes as needed (zh-TW vs zh-CN). Now matching can take place using the language tags; a font supporting the language for a different country will match "less strongly" than a font matching the language for the correct country. Both of these will match more strongly than a font not supporting the language at all. This has the benefit of making traditional Chinese fonts preferred over Japanese fonts for the display of simplified Chinese documents. I think this will work better than the current hack using OS/2 codePageRange bits. Ok, so now I have a direction to run, but I'm missing a ton of data. To generate language coverage for a font, I need to know what Unicode coverage is required for each language. I don't want the coverage offered by fonts designed for the language; that's often far broader than the coverage needed to display text in the language. All I want are the Unicode codepoints for the alphabet, abjad or logography, that way fonts are strictly selected based on language coverage and ignore spurious punctuation or foreign characters common in encodings. My plan is to start with the 139 ISO 639-1 2 letter language codes and add (as needed) 3 letter country codes from ISO 639-2. That still leaves me needing coverage information for 139 languages. I've managed to scrounge coverage information for most European languages along with the non-european 8859 languages and the Han languages. That's a total of 61 languages covering a great deal of the world. Big parts are still missing; most of non-Arabic Africa, non-Han Asia and a smattering of native American languages. If any of you have a particular interest in one of the missing languages, please feel free to build an appropriate coverage table. They're usually quite short and easy to generate as long as one has knowledge or a reference to the source language. They should be as complete as possible; the goal is to avoid using fonts which are missing some common codepoints, one such example is attempting to use an ISO Latin-1 font for Turkish; Latin-1 has all but two codepoints needed to display Turkish, making it nearly complete but also completely unsuitable. Here's an example which should make the format abundantly clear: # Dutch (NL) 0040-005a 0060-007a 00c4 00cb 00cf 00d6 00dc 00e4 00eb 00ef 00f6 00fc #0132-0133 # IJ and ij ligatures I attach a listing of the ISO 639-1 language codes with a '*' marking the languages for which I have coverage information. For those uninterested in the mechanics of generating the tables, please send references to places I can find coverage information for missing languages. I'm willing to take information in whatever format you have. Keith Packard XFree86 Core Team HP Cambridge Research Lab ------ Lang Done Description AA Afar Djibouti, N Ethiopia Hamito-Semitic F., Cushitic Br. AB * Abkhazian Abkhazia (Georgia) Caucasian F. AF Afrikaans South Africa, Namibia Indo-European F., Germanic Br. 10 AM Amharic Ethiopia Hamito-Semitic F., Semitic Br. 20 AR * Arabic Middle East, N Africa Hamito-Semitic F., Semitic Br. 218 AS Assamese Assam (India) Indo-European F., Indo-Iranian Br. 23 AY Aymara Bolivia, Peru Andean-Equatorial F., Andean Br. 2 AZ * Azerbaijani Iran, Azerbaijan Uralo-Altaic F., Turkic Br. 15 BA * Bashkir Bashkir (S Urals, Russia) Uralo-Altaic F., Turkic Br. 1 BE * Byelorussian Byelorussia Indo-European F., Balto-Slavic Br. 10 BG * Bulgarian Bulgaria, Yugoslavia, Greece Indo-European F., Balto-Slavic Br. 9 BH Bihari Bihar (India) Indo-European F., Indo-Iranian Br. BI Bislama Vanuatu, New Caledonia English based creole, Pacific BN Bengali, Bangla Bangladesh, West Bengal (India) Indo-European F., Indo-Iranian Br. 196 BO Tibetan Tibet, Bhutan, Nepal, India Sino-Tibetan F., Tibeto-Burmese Br. 5 BO from Bodskad BR * Breton Britanny (W France) Indo-European F., Celtic Br. CA * Catalan Catalania (NE Spain), Balearic Islands, Sardinia, S France, Andorra, Argentina Indo-European F., Italic Br. 9 CO * Corsican Corsica (France) Indo-European F., Italic Br. CS * Czech Czech Republic Indo-European F., Balto-Slavic Br. 11 CY Welsh Wales (United Kingdom) Indo-European F., Celtic Br. DA * Danish Denmark, Germany Indo-European F., Germanic Br. 5 DE * German Germany, Austria, Switzerland, U.S.A. Indo-European F., Germanic Br. 121 DE from Deutsch DZ Bhutani, Bhutanese Bhutan Sino-Tibetan F., Tibeto-Burmese Br. EL * Greek Greece, Cyprus, Turkey Indo-European F., Hellenic Br. 12 EN * English North America, British Isles, Australia, New Zealand, South Africa Indo-European F., Germanic Br. 470 EO * Esperanto 2 Artificial language ES * Spanish Spain, Latin America, U.S.A. Indo-European F., Italic Br. 381 ET * Estonian Estonia Uralo-Altaic F., Finno-Ugric Br. 1 EU * Basque W Pyrenees (France, Spain) (Isolate) EU from Euskera FA Persian Iran, Afghanistan Indo-European F., Indo-Iranian Br. 35 FA from Farsi FI * Finnish, Suomi Finland, Russia, Sweden Uralo-Altaic F., Finno-Ugric Br. 6 FJ Fiji, Fijian Fiji Austric F., Malayo-Polynesian Br. FO * Faroese, Faeroese Faeroe Islands (Denmark) Indo-European F., Germanic Br. FR * French France, Belgium, Canada, U.S.A., Switzerland Indo-European F., Italic Br. 124 FY * Frisian Frisian Islands (Netherlands-Germany) Indo-European F., Germanic Br. GA * Irish Ireland Indo-European F., Celtic Br. GA from Gaeilge GD * Scots Gaelic Scotland Indo-European F., Celtic Br. GL * Galician Spanish Galicia Indo-European F., Italic Br. 4 GN Guaran? Paraguay, Bolivia, S Brazil Andean-Equatorial F., Equatorial Br. 4 GU Gujarati, Gujerati Gujarat (India), Bombay, Pakistan, South Africa Indo-European F., Indo-Iranian Br. 40 HA Hausa N Nigeria, Niger, Cameroun Hamito-Semitic F., Chadic Br. 37 HE * Hebrew Israel Hamito-Semitic F., Semitic Br. 5 Formerly IW from Iwrith. See Note 4. HI Hindi India, Pakistan, Trinidad, Guyana, Fiji, Mauritius Indo-European F., Indo-Iranian Br. 418 Same as Urdu [UR] except for writing system. See Note 3. HR * Croatian, Croat Croatia Indo-European F., Balto-Slavic Br. HR from Hrvatski. See Note 2. HU * Hungarian, Magyar Hungary, Romania, Yugoslavia, Czechoslovakia Uralo-Altaic F., Finno-Ugric Br. 14 HY * Armenian Armenia, Middle East Indo-European F., Armenian Br. 5 HY from Hayeren IA Interlingua Artificial language ID Indonesian, Bahasa Indonesia Indonesia, Malaysia, Thailand, Singapore, Brunei Austric F., Malayo-Polynesian Br. Formerly IN. See Note 4. IE Interlingue Artificial language. Prototype of Interlingua [IA] IK Inupiak Greenland, N Canada, Alaska (U.S.A.) Eskimo-Aleut F. IS * Icelandic Iceland Indo-European F., Germanic Br. IS from Islenzk IT * Italian Italy, U.S.A., France, Argentina, Switzerland, Canada, Brazil Indo-European F., Italic Br. 62 IU Inuktitut NE Canada Eskimo-Aleut F. See Note 5. JA * Japanese, Nihongo Japan, Brazil, California (U.S.A.), Hawaii (U.S.A.) Japanese-Korean F. 126 JW Javanese Java, Malaysia, Surinam Austric F., Malayo-Polynesian Br. 64 JW from Bahasa Jawa KA * Georgian Georgia Caucasian F. 3 KA from Kartuli KK * Kazakh Kazakhstan, Sinkiang (China), Afghanistan Uralo-Altaic F., Turkic Br. 8 KL * Greenlandic Greenland Eskimo-Aleut F. KL from Kalaallisut KM Cambodian Cambodia, Thailand, Viet Nam Austric F., Austrio-Asiatic Br. 9 KM from Khmer KN Kannada Karnatuka (India) Dravidian F. 44 KO * Korean, Choson-o South Korea, North Korea, NE China, Japan, Siberia, Hawaii (U.S.A.) Japanese-Korean F. 75 KS Kashmiri Kashmir (India-Pakistan) Indo-European F., Indo-Iranian Br. 4 KU Kurdish, Zimany Kurdy Turkey, Iran, Iraq, Syria Indo-European F., Indo-Iranian Br. 11 KY Kirghiz Kirghiz, Sinkiang (China), Afghanistan Uralo-Altaic F., Turkic Br. 2 KY from Kyrgyz LA * Latin Indo-European F., Italic Br. Ancient language nearing extinction LN Lingala, liNgala Zaire, Congo Niger-Kordofanian F., Non-Mande Br. 7 LO Laothian, Pha Xa Lao, Lao Laos, Thailand Sino-Tibetan F., Sino-Siamese Br. 4 LT * Lithuanian Lithuania Indo-European F., Balto-Slavic Br. 3 LV * Latvian, Lettish Latvia Indo-European F., Balto-Slavic Br. 2 MG Malagasy Madagascar Austric F., Malayo-Polynesian Br. 12 MI Maori New Zealand Austric F., Malayo-Polynesian Br. MK * Macedonian Macedonia, Bulgaria, Greece Indo-European F., Balto-Slavic Br. 2 ML Malayalam Kerala (SW India) Dravidian F. 35 MN Mongolian Mongolia Uralo-Altaic F., Mongolic Br. MO * Moldavian MR Marathi, Mahrati Maharashtra (W India) Indo-European F., Indo-Iranian Br. 69 MS Malay Malaysia, Indonesia Austric F., Malayo-Polynesian Br. 155 MS from Bahasa Malaysia MT * Maltese Malta Hamito-Semitic F., Semitic Br. MY Burmese Burma, Bangladesh Sino-Tibetan F., Tibeto-Burmese Br. 30 MY from Myanmasa NA Nauru, Nauruan Nauru Austric F., Malayo-Polynesian Br. NE Nepali, Nepalese Nepal, Uttar Pradesh (India) Indo-European F., Indo-Iranian Br. 16 NL * Dutch Netherlands, Belgium Indo-European F., Germanic Br. 21 NL from Nederlands NO * Norwegian Norway Indo-European F., Germanic Br. 5 OC * Occitan S France Indo-European F., Italic Br. 4 OM (Afan) Oromo, Galla Ethiopia, Kenya Hamito-Semitic F., Cushitic Br. 10 OR Oriya Orissa (E India) Indo-European F., Indo-Iranian Br. 31 PA Punjabi Punjab (India), Pakistan Indo-European F., Indo-Iranian Br. 93 PA from Panjabi PL * Polish Poland, U.S.A. Indo-European F., Balto-Slavic Br. 44 PS Pashto, Pushto, Pushtu Afghanistan, Pakistan Indo-European F., Indo-Iranian Br. 21 PT * Portuguese Brazil, Portugal, Spain, Uruguay, Argentina, Azores, Goa, Madeira Indo-European F., Italic Br. 182 QU Quechua Peru, Ecuador, Bolivia Andean-Equatorial F., Andean Br. 8 RM * Rhaeto-Romance, Rhaeto-Romanic, Romansch S Switzerland, N Italy, Tyrol (Austria) Indo-European F., Italic Br. RN Kirundi, kiRundi Niger-Kordofanian F., Non-Mande Br. RO * Romanian, Rumanian Rumania Indo-European F., Italic Br. 25 RU * Russian Russia, former USSR republics Indo-European F., Balto-Slavic Br. 288 RW Kinyarwanda, kinyaRuanda Rwanda, Burundi, Uganda, Zaire, Tanzania Niger-Kordofanian F., Non-Mande Br. RW from Rwanda SA Sanskrit India Indo-European F., Indo-Iranian Br. Ancient language SD Sindhi Pakistan, Sind (India) Indo-European F., Indo-Iranian Br. 18 SG Sangho, Sango-Ngbandi Central African Republic, Zaire Niger-Kordofanian F., Non-Mande Br. 4 SH * Serbo-Croatian Croatia Indo-European F., Balto-Slavic Br. 20 See Note 2. SI Singhalese, Sinhalese Sri Lanka Indo-European F., Indo-Iranian Br. 13 SK * Slovak Slovakia Indo-European F., Balto-Slavic Br. 5 SL * Slovenian, Slovene Slovenia, Italy, Austria Indo-European F., Balto-Slavic Br. 2 SM Samoan Samoa Austric F., Malayo-Polynesian Br. SN Shona, chiShona Rhodesia, Mozambique Niger-Kordofanian F., Non-Mande Br. 8 SO Somali Somalia, Ethiopia, Kenya Hamito-Semitic F., Cushitic Br. 5 SQ * Albanian Albania, Kosovo (Yugoslavia), Italy, Greece Indo-European F., Albanian Br. 5 SQ from Shqip SR * Serbian Serbia Indo-European F., Balto-Slavic Br. SR from Srpski. See Note 2. SS Siswati, siSwati South Africa, Rhodesia, Swaziland Niger-Kordofanian F., Non-Mande Br. ST Sesotho, siSuthu South Africa, Lesotho, Botswana Niger-Kordofanian F., Non-Mande Br. SU Sundanese West Java Austric F., Malayo-Polynesian Br. 26 SV * Swedish Sweden, Finland Indo-European F., Germanic Br. 9 SV from Svenska SW Swahili, kiSwahili Tanzania, Comoro Islands, Kenya, Mozambique, Zaire Niger-Kordofanian F., Non-Mande Br. 48 TA Tamil Tamil Nadu (S India), Sri Lanka, Malaysia, Singapore Dravidian F. 71 TE Telugu, Telegu Andhra Pradesh (India) Dravidian F. 73 TG Tajik, Tajiki Tadzhikstan Indo-European F., Indo-Iranian Br. 5 TH * Thai Thailand 50 TI Tigrinya N Ethiopia Hamito-Semitic F., Semitic Br. 4 TK Turkmen, Turkoman, Turcoman Turkmenistan, Iran, Afghanistan Uralo-Altaic F., Turkic Br. 3 TL Tagalog Philippines Austric F., Malayo-Polynesian Br. 54 TN Setswana South Africa TO Tonga Niger-Kordofanian F., Non-Mande Br. 7 TR * Turkish Turkey, Bulgaria, Yugoslavia, Cyprus, Greece Uralo-Altaic F., Turkic Br. 59 TS Tsonga 3 TT Tatar, Tartar Tatarstan Uralo-Altaic F., Turkic Br. 8 TW Twi, Tshi W Africa Niger-Kordofanian F., Non-Mande Br. UG Uigur, Uighur, Uyghur Sinkiang (China), Kazakhstan, Uzbekistan, Afghanistan Uralo-Altaic F., Turkic Br. 8 See Note 5. UK * Ukrainian Ukraine, Canada, U.S.A. Indo-European F., Balto-Slavic Br. 47 UR Urdu Pakistan, India Indo-European F., Indo-Iranian Br. 102 Same as Hindi [HI] except for writing system. See Note 3. UZ Uzbek, Uzbeg, Usbek, Usbeg Uzbekstan, Tadzhikstan, Afghanistan Uralo-Altaic F., Turkic Br. 14 VI Vietnamese Viet Nam, Thailand, Cambodia, Laos, New Caledonia, France, Dakar Sino-Tibetan F., Sino-Siamese Br. 65 VO Volap?k Artificial language WO Wolof Senegal, Gambia Niger-Kordofanian F., Non-Mande Br. 7 XH Xhosa, Xosa, isiXhosa South Africa, Rhodesia, Swaziland Niger-Kordofanian F., Non-Mande Br. 8 YI * Yiddish U.S.A., Israel, former USSR, Latin America, Canada, E Europe Indo-European F., Germanic Br. Formerly JI from Jiddisch. See Note 4. YO Yoruba Western, Lagos and Kwara States (Nigeria), Benin Niger-Kordofanian F., Non-Mande Br. 20 ZA Zhuang, Chwang, Chuang China 15 See Note 5. ZH * Chinese China Sino-Tibetan F., Sino-Siamese Br. 1,200 ZH from Zhongwen. See Note 1. ZU Zulu, isiZulu South Africa, Rhodesia, Swaziland Niger-Kordofanian F., Non-Mande Br. 9 _______________________________________________ I18n mailing list [EMAIL PROTECTED] http://XFree86.Org/mailman/listinfo/i18n