> - What is the logic behind the idea of preloading some data in the > format with luatex?
You mean as opposed to having them sit in a Lua file for packages to load on demand? I suppose the intent was that the formats would contain enough (meta)data to be self-descriptive - and we wanted to include hyphen.tex as \language0 anyway. You’d have to ask the three co-authors of luatex-hyphen (Khaled, Manuel, and Élie) directly, as I didn’t contribute to that part of hyph-utf8 much, and I don’t think any of them reads this list. > - Is there any convention as how hyphenation files should be named? > Apparently most of them follow the pattern (load)hyph-LL (LL > = lang iso code) and (load)hyph-LL-SSSS (SSSS = script iso code), > but not all. (And of course, the encoding in the form .ec.) This part has been implemented by Mojca and me. We follow BCP 47 that is, to our knowledge, the only standard that allows to tag languages and their variants to the level of precision that we need. It is defined by the IETF and consists of several of their RFCs (BCP stands for “Best Current Practice”), currently RFC 5646 and 4647; see https://tools.ietf.org/html/bcp47 for the full text. It can have many elements; to sum up, any of the following can occur - only the first element is mandatory, the rest is optional, and the order is normative: * A language code (2-letter ISO 639-1, or, failing that, 3-letter ISO 639-3) * A script code (4-letter ISO 15924) * A country code (2-letter ISO 3166-1 or 3-digit UN M.49) * Additional elements defined in the register (5 to 8 letters or digits) * Private elements prefixed by -x- The registry is maintained by IANA at http://www.iana.org/assignments/language-subtag-registry This standard is very useful because, as mentioned, it allows great precision, but it also encourages not to go into more detail than is needed, and we make every effort to follow that - unlike many software vendors that introduce a flurry of variants of Spanish, Portuguese, or French with little actual differences (tagged as es-ES, es-MX, es-AR, pt-PT, pt-BR, fr-FR, fr-BE, fr-CA, etc.). For each of these three languages we actually have only one set of patterns. The language tags that we do actually use show examples of all the different tag elements above, as for example: * Many languages are identified by their 2-letter ISO 639-1 code alone; but some of them don’t have an ISO 639-1 code and we thus use the (3-letter) ISO 639-3 code: Friulan [fur], Ancient Greek [grc], Piedmontese [pms], and ... Mojca, where have the Ottoman Turkish patterns gone? Anyway. Moving on: * Script tags are used for languages of the Bosnian-Croatian-Serbian diasystem: sh-cyrl, sh-latn and sr-cyrl (see the thread starting at http://tug.org/pipermail/tex-hyphen/2011-July/000805.html for a discussion of the [sh] and [sr] parts) * Country codes are used for English: en-gb and en-us * Subtags from the registry are used for German and Greek: de-1901 and de-1996 (“old” and “new” spelling, first discussed in 1996 but only finalised in 2006), and el-monoton and el-polyton (sadly not “monotonic” and “polytonic” because of the 8-character limit) * Finally, for some languages we had to make up a private tag; fortunately there are only two of them: la-x-classic for “Classical” Latin -- a bit of a misnomer as what it implements is the spelling of Latin where ‘v’ is not used (only ‘u’ is); apart for that it’s no closer to Classical Latin than the original set of patterns, so we could probably find a better name and tag. The other private tag is for “Mongolian LMC”, tagged mn-cyrl-x-lmc as a matter of pure convenience: these patterns were once the only set of patterns for Mongolian, and had been created by Oliver Corff for his specialist needs (typesetting an 18th century pentaglot dictionary). When new patterns were produced by Mongolian users for use in current documents, it seemed an obvious choice to change to these (incidentally the only change we’ve ever made when unifying all patterns into hyph-utf8), while of course keeping the old patterns for Oliver to use. LMC was the name of the font encoding he devised for this purpose. Does that answer your question? Best, Arthur
