Re: [tex-hyphen] luatex and file names

Claudio Beccari Mon, 10 Aug 2015 09:43:05 -0700

As the maintainer of Latin (waiting for a true latinist to take over) Ican say that the variants, used as modifiers with babel and as optionsin polyglossia, are three: modern, medieval and classic.Modern and medieval hyphenation were accommodated in the same paternfile; it is not only a question of u and v, but also of the ligatures æand œ; but, Arthur, you are right when you say that they are almostequal; the hyphenation is mostly phonetic in both variants.

The la-x-classic pattern set is very different: besides the u and Vquestion the hyphenation is mostly etymological, therefore it is verydifferent form the phonetic one, much more difficult to create, since itis also necessary to take into account the case endings of declinationand the diatesis endings of conjugation. The x-classic qualifier is notsimply a tag with no or negligible contents behind its name, as youremark for the three French variants, the three Spanish variants, thetwo Portuguese variants; Surprisingly enough Italian variants such asit-IT and it-CH do not exist, although there exist different spellingdictionaries for these two variants (well, actually I am not surprisedat all, because I maintain also the Italian patterns and I wouldn't dosuch a silly thing as to distinguish Italian-Italian from Swiss-Italianhyphenation).

Hans says there is no advantage in preloading the patterns into theformat file. May be he is right for Lua(La)TeX, but one of the reasonswhy I make little use of LuaLaTeX is because of the "long" time luatexemploys to load the patterns; once loaded, its speed is almost similarto that of XeLaTeX; its performances are better for what concernsmicrotype; its functionality with the interaction between the pdf engineand the Lua interpreter is exceptional, but if one does not need to usethe latter functionality, it is not worth waiting that "long" time whenone has to typeset a text that uses a half dozen languages, and that,besides Engish that is preloaded, must load the other five pattern filesand must create the suitable hash structures so as to use the otherpatterns in an efficient way. I might be completely wrong, but Iconsider this a glitch not an advantage. For certain applications it iscertainly an advantage, because, for example, it is possible tomodify/correct the patterns for special needs. But this is not so frequent.


Claudio

On 10/08/2015 17:32, Arthur Reutenauer wrote:

- What is the logic behind the idea of preloading some data in the
   format with luatex?

   You mean as opposed to having them sit in a Lua file for packages to
load on demand?  I suppose the intent was that the formats would contain
enough (meta)data to be self-descriptive - and we wanted to include
hyphen.tex as \language0 anyway.  You’d have to ask the three co-authors
of luatex-hyphen (Khaled, Manuel, and Élie) directly, as I didn’t
contribute to that part of hyph-utf8 much, and I don’t think any of them
reads this list.

- Is there any convention as how hyphenation files should be named?
   Apparently most of them follow the pattern (load)hyph-LL (LL
   = lang iso code) and (load)hyph-LL-SSSS (SSSS = script iso code),
   but not all. (And of course, the encoding in the form .ec.)

   This part has been implemented by Mojca and me.  We follow BCP 47 that
is, to our knowledge, the only standard that allows to tag languages and
their variants to the level of precision that we need.  It is defined by
the IETF and consists of several of their RFCs (BCP stands for “Best
Current Practice”), currently RFC 5646 and 4647; see 
https://tools.ietf.org/html/bcp47
for the full text.

   It can have many elements; to sum up, any of the following can occur -
only the first element is mandatory, the rest is optional, and the order
is normative:

   * A language code (2-letter ISO 639-1, or, failing that, 3-letter ISO 639-3)
   * A script code (4-letter ISO 15924)
   * A country code (2-letter ISO 3166-1 or 3-digit UN M.49)
   * Additional elements defined in the register (5 to 8 letters or digits)
   * Private elements prefixed by -x-

   The registry is maintained by IANA at 
http://www.iana.org/assignments/language-subtag-registry

   This standard is very useful because, as mentioned, it allows great
precision, but it also encourages not to go into more detail than is
needed, and we make every effort to follow that - unlike many software
vendors that introduce a flurry of variants of Spanish, Portuguese, or
French with little actual differences (tagged as es-ES, es-MX, es-AR,
pt-PT, pt-BR, fr-FR, fr-BE, fr-CA, etc.).  For each of these three
languages we actually have only one set of patterns.

   The language tags that we do actually use show examples of all the
different tag elements above, as for example:

   * Many languages are identified by their 2-letter ISO 639-1 code
     alone; but some of them don’t have an ISO 639-1 code and we thus use
     the (3-letter) ISO 639-3 code: Friulan [fur], Ancient Greek [grc],
     Piedmontese [pms], and ... Mojca, where have the Ottoman Turkish
     patterns gone?  Anyway.  Moving on:

   * Script tags are used for languages of the Bosnian-Croatian-Serbian
     diasystem: sh-cyrl, sh-latn and sr-cyrl (see the thread
     starting at http://tug.org/pipermail/tex-hyphen/2011-July/000805.html
     for a discussion of the [sh] and [sr] parts)

   * Country codes are used for English: en-gb and en-us

   * Subtags from the registry are used for German and Greek:
     de-1901 and de-1996 (“old” and “new” spelling, first discussed in
     1996 but only finalised in 2006), and el-monoton and el-polyton
     (sadly not “monotonic” and “polytonic” because of the 8-character limit)

   * Finally, for some languages we had to make up a private tag;
     fortunately there are only two of them: la-x-classic for “Classical”
     Latin -- a bit of a misnomer as what it implements is the spelling
     of Latin where ‘v’ is not used (only ‘u’ is); apart for that it’s no
     closer to Classical Latin than the original set of patterns, so we
     could probably find a better name and tag.  The other private tag is
     for “Mongolian LMC”, tagged mn-cyrl-x-lmc as a matter of pure
     convenience: these patterns were once the only set of patterns for
     Mongolian, and had been created by Oliver Corff for his specialist
     needs (typesetting an 18th century pentaglot dictionary).  When new
     patterns were produced by Mongolian users for use in current
     documents, it seemed an obvious choice to change to these
     (incidentally the only change we’ve ever made when unifying all
     patterns into hyph-utf8), while of course keeping the old patterns
     for Oliver to use.  LMC was the name of the font encoding he devised
     for this purpose.

   Does that answer your question?

        Best,

                Arthur

Re: [tex-hyphen] luatex and file names

Reply via email to