On Tue, Apr 14, 2026 at 08:04:16PM +0100, Gavin Smith wrote:
> I'm still sceptical about other ways of specifying language variants using
> codes from BCP 47.
> 
> However, how do we know that that the BCP 47 system promulgated by the
> IETF and/or IANA really classifies dialects of a language in the most
> appropriate way?

We do not know.  But the ISO is not infallible either.

> As far as I know, the IETF and IANA are separate
> organisations from the ISO.  If the "variant" codes were a straightforward
> extension of ISO 639, then it would be easier to accept, but it seems
> to be its own system run by different organisations.

I do not understand why you consider differently ISO, IANA and IETF (or
the Unicode consortium, which relies on IANA too for its data).  All are
standard bodies, IETF and IANA are associated to the internet, while ISO
is more generic, but none is more trustworthy than another.

> It may not be the case that adopting BCP 47 achieves the aim of being able
> to designate all the languages that users want to use in their documents.

As far as I can tell, it is the best known way to do so.  I have looked
a bit at all the entities that choose BCP 47, the one that give
explanation give this reason, labelled 'language neutrality' in some
case (maybe reStructured?).  The argument for libreoffice is the more
detailed one and it clearly select BCP 47 for this, inter alia:
https://wiki.documentfoundation.org/images/b/b5/LibreOffice_FOSDEM-2013_Language_Tags.pdf

> It seems that dialect or language classification is a hard and probably
> never-ending job.

It is.  But it is not a reason not to use the best current information
source on them.  Otherwise the variants will never be used, and the
languages that do not have a main variant will never be well supported.

> A finer-grained classification than the ISO 639 language codes may be harder
> to achieve successfully.  So the extent to which BCP 47 solves a problem
> depends on how well the IETF and/or IANA maintain the language codes.

Sure.  (Note It is the sole IANA responsibility, the IETF mandates that
the IANA is in charge for BCP 46, but does not do anything for the
variants selection).

> For example, the IANA subtag registry gives about 10 variants of Occitan,
> but not that many for most other languages.  Is this just because somebody
> wrote in asking for them to be added, or do they have some process of getting
> linguistic experts to check information?  Do they have people working
> all over the world studying dialects?

I do not know.  All I can say is that, for somebody who is not very
knowledgable on the matter, not a speaker nor reader but has an interest
in occitan, this looks like good choices.  Among others, of course,
there is no absolute certainty that the choices made are the best.  It
is the same for languages, actually.  The languages tags change over
time.  But, even if the IANA work is imperfect, which I have no evidence
of, I still rest my case that ignoring all the language variants is much
worse than being occasionally wrong for some variants.

> It seems that the problem of dialect classification is potentially
> very open to input from biased people pushing pet theories (which
> would especially be a problem for more obscure languages than Occitan
> - presumably somebody would pick up if somebody was trying to invent
> a non-existent dialect of Occitan, but this might not be the case for
> other languages).

Is that a real issue?  Worse that not supporting any of the language
variants?

> Suppose there turned out to be a flaw in the way that the IANA allocated
> sub-language codes.  Then we'd be stuck with referencing a broken system.

Not only us.  Everybody, as all the internet uses the IANA system,
through HTML, XML, libreoffice, LaTeX, wikipedia...  I do not believe
that we would be the most impacted, HTML, libreoffice or wikipedia have
a goal of handling all the languages and have users and content in many
language variants.  We definitively are not at the forefront of
supporting the diversity of languages...  In any case, we could change
the data we use if we are dissatisfied with IANA.

> I imagine there well may be disputes as to the best way to study, document
> and classify the underlying linguistic reality, especially when it comes
> to minor linguistic variations.  It's possible there may be systems for
> classifying languages and dialects that may be better than what BCP
> 47 does, or that there such systems may exist in the future.

Then we can switch.  All the evidence points towards the IANA selection
being the state of the art nowadays.  But we can switch whenever we
want.

> For example, if there was a real practical need for distinguishing
> variants of a language, maybe the ISO would invent new top-level codes
> for them.

Clearly not, there are many language variants that are in use, and the
ISO did not invent new top-level codes for them.  And why would they,
the IANA do it and I have not seen any evidence of discontent about the
work of IANA.

> There are ISO 639 codes for languages that could be considered
> dialects.  There is "yue" for Yue (Cantonese) even though there is already
> "zh" for Chinese.  There is "oc" for Occitan as well as codes for closely
> related Romance languages.  There is "sco" for Scots even though there
> is also "en" for English.

Indeed, ISO codes evolve.  Yet they definitively do not include many
variants.  When they do, the IANA data is updated, so I don't see the
issue here.  If the ISO was to include all the variants, we would get it
immediately by using the IANA data.

> If the proposal to reference the distinctions enabled in BCP 47 were
> driven by present practical needs, then we could better tell if using
> it were likely to be sufficient for those needs.  Hower, the inability to
> distinguish dialects of Occitan, or to distinguish Ekavian and Ijekavian
> pronuciations of Serbian, appear merely hypothetical problems.

For the Occitan it is not an hypothetical issue, that's for sure.  What
is not known is wether there are people knowledgable in Occitan willing
to write manuals and translated strings.  Outside of Texinfo, variants
are definitively used, see babel for example for their list, it includes
be-tarask, sr-ijekavsk, de-AT-1901, el-polyton.  (No Occitan variant
though, although I would have bet on it ;-).

-- 
Pat

Reply via email to