Small update: I went through the language list at

https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py#L472

and added a number of TODOs to the most obvious problematic cases. Typical problems are:

* Malformed language codes ('tokipona')
* Correctly formed language codes without any official meaning (e.g., 'cbk-zam') * Correctly formed codes with the wrong meaning (e.g., 'sr-ec': Serbian from Ecuador?!) * Language codes with redundant information (e.g., 'kk-cyrl' should be the same as 'kk' according to IANA, but we have both) * Use of macrolanguages instead of languages (e.g., "zh" is not "Mandarin" but just "Chinese"; I guess we mean Mandarin; less sure about Kurdish ...) * Language codes with incomplete information (e.g., "sr" should be "sr-Cyrl" or "sr-Latn", both of which already exist; same for "zh" and "zh-Hans"/"zh-Hant", but also for "zh-HK" [is this simplified or traditional?]).

I invite any language experts to look at the file and add comments/improvements. Some of the issues should possibly also be considered on the implementation side: we don't want two distinct codes for the same thing.

Cheers,

Markus


On 04/08/13 16:35, Markus Krötzsch wrote:
On 04/08/13 13:17, Federico Leva (Nemo) wrote:
Markus Krötzsch, 04/08/2013 12:32:
* Wikidata uses "be-x-old" as a code, but MediaWiki messages for this
language seem to use "be-tarask" as a language code. So there must be a
mapping somewhere. Where?

Where I linked it.

Are you sure? The file you linked has mappings from site ids to language
codes, not from language codes to language codes. Do you mean to say:
"If you take only the entries of the form 'XXXwiki' in the list, and
extract a language code from the XXX, then you get a mapping from
language codes to language codes that covers all exceptions in
Wikidata"? This approach would give us:

'als' : 'gsw',
'bat-smg': 'sgs',
'be_x_old' : 'be-tarask',
'crh': 'crh-latn',
'fiu_vro': 'vro',
'no' : 'nb',
'roa-rup': 'rup',
'zh-classical' : 'lzh'
'zh-min-nan': 'nan',
'zh-yue': 'yue'

Each of the values on the left here also occur as language tags in
Wikidata, so if we map them, we use the same tag for things that
Wikidata has distinct tags for. For example, Q27 has a label for yue but
also for zh-yue [1]. It seems to be wrong to export both of these with
the same language tag if Wikidata uses them for different purposes.

Maybe this is a bug in Wikidata and we should just not export texts with
any of the above codes at all (since they always are given by another
tag directly)?


* MediaWiki's http://www.mediawiki.org/wiki/Manual:$wgDummyLanguageCodes
provides some mappings. For example, it maps "zh-yue" to "yue". Yet,
Wikidata use both of these codes. What does this mean?

Answers to Nemo's points inline:

On 04/08/13 06:15, Federico Leva (Nemo) wrote:
Markus Krötzsch, 03/08/2013 15:48:

...

Apart from the above, doesn't wgLanguageCode in
https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php

have what you need?

Interesting. However, the list there does not contain all 300 sites that
we currently find in Wikidata dumps (and some that we do not find there,
including things like dkwiki that seem to be outdated). The full list of
sites we support is also found in the file I mentioned above, just after
the language list (variable siteLanguageCodes).

Of course not all wikis are there, that configuration is needed only
when the subdomain is "wrong". It's still not clear to me what codes you
are considering wrong.

Well, the obvious: if a language used in Wikidata labels or on Wikimedia
sites has an official IANA code [2], then we should use this code. Every
other code would be "wrong". For languages that do not have any accurate
code, we should probably use a private code, following the requirements
of BCP 47 for private use subtags (in particular, they should have a
single x somewhere).

This does not seem to be done correctly by my current code. For example,
we now map 'map_bmswiki' to 'map-bms'. While both 'map' and 'bms' are
lANA language tags, I am not sure that their combination makes sense.
The language should be Basa Banyumasan, but bms is for Bilma Kanuri (and
it is a language code, not a dialect code). Note that map-bms does not
occur in the file you linked to, so I guess there is some more work to do.

Markus

[1] http://www.wikidata.org/wiki/Special:Export/Q27
[2]
http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry





_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Reply via email to