As the maintainer of Latin (waiting for a true latinist to take over) I can say that the variants, used as modifiers with babel and as options in polyglossia, are three: modern, medieval and classic. Modern and medieval hyphenation were accommodated in the same patern file; it is not only a question of u and v, but also of the ligatures æ and œ; but, Arthur, you are right when you say that they are almost equal; the hyphenation is mostly phonetic in both variants.

The la-x-classic pattern set is very different: besides the u and V question the hyphenation is mostly etymological, therefore it is very different form the phonetic one, much more difficult to create, since it is also necessary to take into account the case endings of declination and the diatesis endings of conjugation. The x-classic qualifier is not simply a tag with no or negligible contents behind its name, as you remark for the three French variants, the three Spanish variants, the two Portuguese variants; Surprisingly enough Italian variants such as it-IT and it-CH do not exist, although there exist different spelling dictionaries for these two variants (well, actually I am not surprised at all, because I maintain also the Italian patterns and I wouldn't do such a silly thing as to distinguish Italian-Italian from Swiss-Italian hyphenation).

Hans says there is no advantage in preloading the patterns into the format file. May be he is right for Lua(La)TeX, but one of the reasons why I make little use of LuaLaTeX is because of the "long" time luatex employs to load the patterns; once loaded, its speed is almost similar to that of XeLaTeX; its performances are better for what concerns microtype; its functionality with the interaction between the pdf engine and the Lua interpreter is exceptional, but if one does not need to use the latter functionality, it is not worth waiting that "long" time when one has to typeset a text that uses a half dozen languages, and that, besides Engish that is preloaded, must load the other five pattern files and must create the suitable hash structures so as to use the other patterns in an efficient way. I might be completely wrong, but I consider this a glitch not an advantage. For certain applications it is certainly an advantage, because, for example, it is possible to modify/correct the patterns for special needs. But this is not so frequent.

Claudio

On 10/08/2015 17:32, Arthur Reutenauer wrote:
- What is the logic behind the idea of preloading some data in the
   format with luatex?
   You mean as opposed to having them sit in a Lua file for packages to
load on demand?  I suppose the intent was that the formats would contain
enough (meta)data to be self-descriptive - and we wanted to include
hyphen.tex as \language0 anyway.  You’d have to ask the three co-authors
of luatex-hyphen (Khaled, Manuel, and Élie) directly, as I didn’t
contribute to that part of hyph-utf8 much, and I don’t think any of them
reads this list.

- Is there any convention as how hyphenation files should be named?
   Apparently most of them follow the pattern (load)hyph-LL (LL
   = lang iso code) and (load)hyph-LL-SSSS (SSSS = script iso code),
   but not all. (And of course, the encoding in the form .ec.)
   This part has been implemented by Mojca and me.  We follow BCP 47 that
is, to our knowledge, the only standard that allows to tag languages and
their variants to the level of precision that we need.  It is defined by
the IETF and consists of several of their RFCs (BCP stands for “Best
Current Practice”), currently RFC 5646 and 4647; see 
https://tools.ietf.org/html/bcp47
for the full text.

   It can have many elements; to sum up, any of the following can occur -
only the first element is mandatory, the rest is optional, and the order
is normative:

   * A language code (2-letter ISO 639-1, or, failing that, 3-letter ISO 639-3)
   * A script code (4-letter ISO 15924)
   * A country code (2-letter ISO 3166-1 or 3-digit UN M.49)
   * Additional elements defined in the register (5 to 8 letters or digits)
   * Private elements prefixed by -x-

   The registry is maintained by IANA at 
http://www.iana.org/assignments/language-subtag-registry

   This standard is very useful because, as mentioned, it allows great
precision, but it also encourages not to go into more detail than is
needed, and we make every effort to follow that - unlike many software
vendors that introduce a flurry of variants of Spanish, Portuguese, or
French with little actual differences (tagged as es-ES, es-MX, es-AR,
pt-PT, pt-BR, fr-FR, fr-BE, fr-CA, etc.).  For each of these three
languages we actually have only one set of patterns.

   The language tags that we do actually use show examples of all the
different tag elements above, as for example:

   * Many languages are identified by their 2-letter ISO 639-1 code
     alone; but some of them don’t have an ISO 639-1 code and we thus use
     the (3-letter) ISO 639-3 code: Friulan [fur], Ancient Greek [grc],
     Piedmontese [pms], and ... Mojca, where have the Ottoman Turkish
     patterns gone?  Anyway.  Moving on:

   * Script tags are used for languages of the Bosnian-Croatian-Serbian
     diasystem: sh-cyrl, sh-latn and sr-cyrl (see the thread
     starting at http://tug.org/pipermail/tex-hyphen/2011-July/000805.html
     for a discussion of the [sh] and [sr] parts)

   * Country codes are used for English: en-gb and en-us

   * Subtags from the registry are used for German and Greek:
     de-1901 and de-1996 (“old” and “new” spelling, first discussed in
     1996 but only finalised in 2006), and el-monoton and el-polyton
     (sadly not “monotonic” and “polytonic” because of the 8-character limit)

   * Finally, for some languages we had to make up a private tag;
     fortunately there are only two of them: la-x-classic for “Classical”
     Latin -- a bit of a misnomer as what it implements is the spelling
     of Latin where ‘v’ is not used (only ‘u’ is); apart for that it’s no
     closer to Classical Latin than the original set of patterns, so we
     could probably find a better name and tag.  The other private tag is
     for “Mongolian LMC”, tagged mn-cyrl-x-lmc as a matter of pure
     convenience: these patterns were once the only set of patterns for
     Mongolian, and had been created by Oliver Corff for his specialist
     needs (typesetting an 18th century pentaglot dictionary).  When new
     patterns were produced by Mongolian users for use in current
     documents, it seemed an obvious choice to change to these
     (incidentally the only change we’ve ever made when unifying all
     patterns into hyph-utf8), while of course keeping the old patterns
     for Oliver to use.  LMC was the name of the font encoding he devised
     for this purpose.

   Does that answer your question?

        Best,

                Arthur

Reply via email to