daniel added a comment.
For the record, I don't remember what specifically went into the decision to rely on ISO-639-1. I have tried to gather some information around the topic. here are some thoughts and observations: - Allowing both ISO-639-1 and ISO-639-3 would mean we'd end up with multiple identifiers for the same language. French could use either "fr" or "fra", and if we allowed ISO-639-2b as well, even "fre". That way, we may end up with multiple terms in the same language, using different code. We'd need to manage a list of aliases for this to work properly. - Using P9753 <https://phabricator.wikimedia.org/P9753> seems like a workable solution, though it kind feels like cheating to me... since it refers to Wikidata itself. - The NewLexeme form is very confusing to me... how would I enter something like de-x-Q2031873 to represent "German with spelling per the 1901 conventions, before the 1996 reform"? It seems to me that the form doesn't allow variants to be entered for a known language. Rather, it asks for a language code if it doesn't know one for the language given. Perhaps the form field should just be called "language code", then? - The Wikidata data model does not specify which codes can be used in language tags. The conceptual model says //"a short string for identifying languages, based on the language preference setting of logged in Wikipedia users. (This might be more similar to BCP 47 but is not necessarily the same either; it is more fine-grained than a GlobalSiteIdentifier) "//. - If I had to design this again, I'd use just Q-Ids internally, and map to language code when generating HTML, RDF, etc. - HTML5 and XML require the `lang` attribute to be BCP47/RFC5646. The specs say //"The lang attribute (in no namespace) specifies the primary language for the element's contents and for any of the element's attributes that contain text. Its value must be a valid BCP 47 language tag, or the empty string."// and //"The values of the attribute are language identifiers as defined by [IETF BCP 47], Tags for the Identification of Languages."// respectively. - RDF Turtle requires language tags to be BCP 47: //"Literals are composed of a lexical form and an optional language tag [BCP47] or datatype IRI."//. - As far as I can determine from a browsing the spec, BCP 47 is a superset of ISO-639-1. It includes //many// codes from ISO-639-2 and ISO-639-3, but only if there wasn't an ISO-639-1 code for it, to avoid ambiguity (see section 2.2.1 item 6). - P305 <https://phabricator.wikimedia.org/P305> is used in Wikidata to refer to BCP 47 language codes. Given all of the above, I would recommends the following to determine the language code for a given item: check P9753 <https://phabricator.wikimedia.org/P9753> (explicit wikidata code), then fall back to P305 <https://phabricator.wikimedia.org/P305> (BCP 47), then fall back to P218 <https://phabricator.wikimedia.org/P218> (ISO-639-1). Do not use P220 <https://phabricator.wikimedia.org/P220> (ISO-639-3), since that might introduce ambiguity. If all else fails, we could still use `mis-x-Pxxxx` to generate a lanague code for any item, but that may be problematic if a language code is later introduced for that item. All terms would have to be re-tagged. TASK DETAIL https://phabricator.wikimedia.org/T284882 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: daniel Cc: daniel, Denny, Mahir256, Lucas_Werkmeister_WMDE, Lydia_Pintscher, Bugreporter, waldyrious, Nikki, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org