daniel added a comment.

  For the record, I don't remember what specifically went into the decision to 
rely on ISO-639-1. I have tried to gather some information around the topic. 
here are some thoughts and observations:
  
  - Allowing both ISO-639-1 and ISO-639-3 would mean we'd end up with multiple 
identifiers for the same language. French could use either "fr" or "fra", and 
if we allowed ISO-639-2b as well, even "fre". That way, we may end up with 
multiple terms in the same language, using different code. We'd need to manage 
a list of aliases for this to work properly.
  - Using P9753 <https://phabricator.wikimedia.org/P9753> seems like a workable 
solution, though it kind feels like cheating to me... since it refers to 
Wikidata itself.
  - The NewLexeme form is very confusing to me... how would I enter something 
like de-x-Q2031873 to represent "German with spelling per the 1901 conventions, 
before the 1996 reform"? It seems to me that the form doesn't allow variants to 
be entered for a known language. Rather, it asks for a language code if it 
doesn't know one for the language given. Perhaps the form field should just be 
called "language code", then?
  - The Wikidata data model does not specify which codes can be used in 
language tags. The conceptual model says //"a short string for identifying 
languages, based on the language preference setting of logged in Wikipedia 
users. (This might be more similar to BCP 47 but is not necessarily the same 
either; it is more fine-grained than a GlobalSiteIdentifier) "//.
  - If I had to design this again, I'd use just Q-Ids internally, and map to 
language code when generating HTML, RDF, etc.
  - HTML5 and XML require the `lang` attribute to be BCP47/RFC5646. The specs 
say //"The lang attribute (in no namespace) specifies the primary language for 
the element's contents and for any of the element's attributes that contain 
text. Its value must be a valid BCP 47 language tag, or the empty string."// 
and //"The values of the attribute are language identifiers as defined by [IETF 
BCP 47], Tags for the Identification of Languages."// respectively.
  - RDF Turtle requires language tags to be BCP 47: //"Literals are composed of 
a lexical form and an optional language tag [BCP47] or datatype IRI."//.
  - As far as I can determine from a browsing the spec, BCP 47 is a superset of 
ISO-639-1. It includes //many// codes from ISO-639-2 and ISO-639-3, but only if 
there wasn't an ISO-639-1 code for it, to avoid ambiguity (see section 2.2.1 
item 6).
  - P305 <https://phabricator.wikimedia.org/P305> is used in Wikidata to refer 
to BCP 47 language codes.
  
  Given all of the above, I would recommends the following to determine the 
language code for a given item:  check P9753 
<https://phabricator.wikimedia.org/P9753> (explicit wikidata code), then fall 
back to P305 <https://phabricator.wikimedia.org/P305> (BCP 47), then fall back 
to  P218 <https://phabricator.wikimedia.org/P218> (ISO-639-1). Do not use P220 
<https://phabricator.wikimedia.org/P220> (ISO-639-3), since that might 
introduce ambiguity. If all else fails, we could still use `mis-x-Pxxxx` to 
generate a lanague code for any item, but that may be problematic if a language 
code is later introduced for that item. All terms would have to be re-tagged.

TASK DETAIL
  https://phabricator.wikimedia.org/T284882

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: daniel
Cc: daniel, Denny, Mahir256, Lucas_Werkmeister_WMDE, Lydia_Pintscher, 
Bugreporter, waldyrious, Nikki, Invadibot, maantietaja, Akuckartz, Nandana, 
Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, 
Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

Reply via email to