Re: allowing an @modifier for documentlanguage locale-based argument

Gavin Smith Mon, 13 Apr 2026 13:49:30 -0700

On Mon, Apr 13, 2026 at 06:06:52PM +0200, Patrice Dumas wrote:
> On Mon, Apr 13, 2026 at 07:21:50AM +0100, Gavin Smith wrote:
> > On Sun, Apr 12, 2026 at 11:19:44PM +0200, Patrice Dumas wrote:
> > > Maybe, be we can not say about all the languages in the IANA and we
> > > should rely on the best information existing, which is this information.
> > > For Occitan, which I know a bit about, the IANA is quite right, with the
> > > main variants as would be expected today (lengadoc in the place I live).
> > > There are written systems, lexics for computing related vocabulary.  The
> > > number of people speaking Occitan in general is decreasing, there are
> > > probably not that many persons interested by Occitan for computer
> > > related documentation and even less able to translate manuals and
> > > strings, but still, deciding that Occitan variants cannot be used as
> > > documentlanguage, while we have the information to do so seems wrong to
> > > me.
> > 
> > Could we make @documentlanguage take the argument LL[_CC][@VARIANT] where
> > VARIANT is the script, and add another command to specify language variant?
> > 
> > For example:
> > 
> > @documentlanguage oc
> > @documentlanguagevariant lengadoc
> > 
> > or
> > 
> > @documentlanguage sr@latin
> > @documentlanguagevariant ekavsk
> > 
> > That way the translations using .po files can be used as usual and the
> > "variant" can be used to mark the language in output formats where this
> > is supported.
> 
> This seems to me to be much less logical than the other way round.
> Indeed, @documentlanguagevariant really belongs to the @documentlanguage
> class of information, while the script information is more or less
> unrelated, except that it is used together to find the right strings to
> translate.
> 
> More generally, I think that we should consider that there are three
> different languages representations related to Texinfo.
> 
> 3) The third one is about how we get strings.
>    - in texi2any, this is from gettext, therefore we need to use the
>      XPG locales notation because it is what gettext understands currently.
>    - for TeX we can do whatever we want
> 
> 2) The second one is about how we represent the languages internally.
> I recently did a commit that changes texi2any to:
>  - store all the components of the locale language separately (main
>     language, region, and in the future, script and variants)
>  - use the BCP 47 language locales internally as hash keys or similar
> I made this choice because the BCP 47 language locales are uniquely and
> exhaustively specified using the BCP 47 specification + the IANA
> language-subtag-registry, in contrast with XPG locales.  And secondarily
> because in all the output formats BCP 47 is used.
> 
> 1) The first representation is the representation in the Texinfo language.
> To me this is the one that really matters, because this is the one we
> want to be right about and avoid having to change in the future, because
> we really dislike a lot changes to the Texinfo language.  For this
> representation, my preference would be to follow the conceptual model of
> BCP 47 because it is well defined with language, region, script and
> language variants in contrast with the XPG locales for which the script
> and language variants are there but are not well specified.  Also, BCP
> 47 + the IANA registry is the best source of existing languages, and
> although there are languages that are not relevant, no language is
> ignored, something that is much more important to me.  Also, the BCP 47
> information can map to XPG locales (for texi2any), while the reverse is
> less clear.
> 
> Regarding how those informations are provided, my preference is still
> 
> @documentlanguage lang_REGION_variant1_variant2...
> @documentscript script
> 
> An empty @documentscript is valid.  For @documentlanguage, a lang needs
> to be specified, the remaining is optional.
> 
> With script a ISO15924 4 letter code, and variant* found in 
> https://www.iana.org/assignments/language-subtag-registry
> 
> The presentation could be something else, but I can't imagine being
> convinced not to use the BCP 47 model + the IANA registry.


BCP 47 and other documentation is vast so it will take me some time to
get a grip on the situation.  Here are some notes.

I don't have a clear idea, but first I would say that the two-letter
argument to @documentlanguage would by far be the most common:

@documentlanguage pl

and so on.  As long as this continues undisturbed, the rest may not matter
that much.

Currently, the only extension that is necessary in Texinfo beyond this
is for Brazilian Portuguese, where writing the following:

@documentlanguage pt_BR

- is necessary to specify the language.

Now, it is possible that some manuals use country suffixes unnecessarily.
I found a few manuals that used the following:

@documentlanguage en_US

So support for such country suffixes should continue in some form, although
it is a use of secondary importance.

It's possible we could also support such country suffixes with a more
general feature:

@documentlanguage en
@documentlanguagevariant US

Then maybe you could also do

@documentlanguage en
@documentlanguagevariant 826

to use the three-digit country codes (although this seems pointless to me).

The use of countries (with ISO codes) as names for variants is a coarse
simplification that, I expect, served the purpose of localising software
to different countries fairly well.  People localising software were not
generally interested in producing translations for many minor dialects of
a language and usually the people in one country could read the language
written in the standard version for that country.

gettext translation locale names, and POSIX locale names, also allow for
a "@variant" suffix, as in "sr@latin" to indicate Serbian written using
the Latin alphabet, but this is not currently used by Texinfo.  (This system
seems to me to be a kind of "bolt on" and possibly not worth using if there
is something better.)

Now it's possible that the use of a country code as a suffix is insufficient
to specify a dialect or variant of interest.  It's not necessarily the case
that everyone speaks or writes the same language in one country.

It is possible to further classify variants of a language, as in the example
of Occitan, which as in the IANA classification you referenced, has several
recognised regional variants, and it would not be possible to distinguish
these with country codes.

In the BCP 47 scheme (which is currently defined in RFC 5646 -
https://www.rfc-editor.org/rfc/rfc5646), a more elaborate format of
language specifier is used.  (BCP is short for "Best Current Practice"
which refers to a document series within the RFC series.)  As well as
the base language identifier (usually two letters, as in "pl", it also
allows for multiple language "sub-tags" to be suffixed to the language.

It's possible that BCP 47 is designed for other uses, such as recording
the language of entries in a library collection (so "bibliographic use").

I'm concerned about the complexity of BCP 47 as well as the lack of any
apparent demand to produce Texinfo documents written in languages that would
need to specified using the language variants that BCP 47 allows.

Hence I'd like to consider if there could be a simpler approach.  Using BCP 47
language tags in the Texinfo language would make the BCP 47 documentation
part of the Texinfo language by reference.  It is not especially easy
to read and written in a legalistic kind of writing with great care about
compliance with other standards and backwards compatibility.  I wonder if
we could consider what distinctions BCP 47 allows.

BCP 47 language subtags are required to occur in a particular order if they
occur, although it's not required to have a subtag of each type:

>From RFC 5646:

    There are different types of subtag, each of which is distinguished
    by length, position in the tag, and content: each subtag's type can
    be recognized solely by these features.

I don't have a completely clear view of this but it appears that the type
of subtag is mainly determined by its length, e.g. "variants" are between
5 and 8 letters.  Maybe if we had a way of providing such "subtag" information,
we wouldn't need to stick to the exact order of BCP 47.  Script could be
provided in a separate command.  Here's one idea:

@documentlanguage sr
@documentlanguagevariant ekavsk
@documentscript latin

I don't like the "latn" and "cyrl" abbreviations used in BCP 47 for "latin"
and "cyrillic" (I'm aware these abbreviated names come from incorporation
of another ISO standard) and think we should just stick to "latin" and
"cyrillic" as already used in .po files.

That's all for tonight.

Re: allowing an @modifier for documentlanguage locale-based argument

Reply via email to