Hi Markus,
Our code 'sr-ec' is at this moment effectively equivalent to 'sr-Cyrl', likewise
is our code 'sr-el' currently effectively equivalent to 'sr-Latn'. Both might change,
once dialect codes of Serbian are added to the IANA subtag registry at
http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
Our code 'nrm' is not being used for the Narom language as ISO 639-3 does, see:
http://www-01.sil.org/iso639-3/documentation.asp?id=nrm
http://www-01.sil.org/iso639-3/documentation.asp?id=nrm
We rather use it for the Norman / Nourmaud, as described in
http://en.wikipedia.org/wiki/Norman_language
http://en.wikipedia.org/wiki/Norman_language
The Norman language is recognized by the linguist list and many others but as of
yet not present in ISO 639-3. It should probably be suggested to be added.
We should probaly map it to a private code meanwhile.
Our code 'ksh' is currently being used to represent a superset of what it stands for
in ISO 639-3. Since ISO 639 lacks a group code for Ripuarian, we use the code of the
only Ripuarian variety (of dozens) having a code, to represent the whole lot. We
should probably suggest to add a group code to ISO 639, and at least the dozen+
Ripuarian languages that we are using, and map 'ksh' to a private code for Ripuarian
meanwhile.
Note also, that for the ALS/GSW and the KSH Wikipedias, page titles are not
guaranteed to be in the languages of the Wikipedias. They are often in German
instead. Details to be found in their respective page titleing rules. Moreover,
for the ksh Wikipedia, unlike some other multilingual or multidialectal Wikipedias,
texts are not, or quite often incorrectly, labelled as belonging to a certain dialect.
See also: http://meta.wikimedia.org/wiki/Special_language_codes
Greetings -- Purodha
Gesendet: Sonntag, 04. August 2013 um 19:01 Uhr
Von: "Markus Krötzsch" <mar...@semantic-mediawiki.org>
An: "Federico Leva (Nemo)" <nemow...@gmail.com>
Cc: "Discussion list for the Wikidata project." <wikidata-l@lists.wikimedia.org>
Betreff: [Wikidata-l] Wikidata language codes (Was: Wikidata RDF export available)
Von: "Markus Krötzsch" <mar...@semantic-mediawiki.org>
An: "Federico Leva (Nemo)" <nemow...@gmail.com>
Cc: "Discussion list for the Wikidata project." <wikidata-l@lists.wikimedia.org>
Betreff: [Wikidata-l] Wikidata language codes (Was: Wikidata RDF export available)
Small update: I went through the language list at
https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py#L472
and added a number of TODOs to the most obvious problematic cases.
Typical problems are:
* Malformed language codes ('tokipona')
* Correctly formed language codes without any official meaning (e.g.,
'cbk-zam')
* Correctly formed codes with the wrong meaning (e.g., 'sr-ec': Serbian
from Ecuador?!)
* Language codes with redundant information (e.g., 'kk-cyrl' should be
the same as 'kk' according to IANA, but we have both)
* Use of macrolanguages instead of languages (e.g., "zh" is not
"Mandarin" but just "Chinese"; I guess we mean Mandarin; less sure about
Kurdish ...)
* Language codes with incomplete information (e.g., "sr" should be
"sr-Cyrl" or "sr-Latn", both of which already exist; same for "zh" and
"zh-Hans"/"zh-Hant", but also for "zh-HK" [is this simplified or
traditional?]).
I invite any language experts to look at the file and add
comments/improvements. Some of the issues should possibly also be
considered on the implementation side: we don't want two distinct codes
for the same thing.
Cheers,
Markus
On 04/08/13 16:35, Markus Krötzsch wrote:
> On 04/08/13 13:17, Federico Leva (Nemo) wrote:
>> Markus Krötzsch, 04/08/2013 12:32:
>>> * Wikidata uses "be-x-old" as a code, but MediaWiki messages for this
>>> language seem to use "be-tarask" as a language code. So there must be a
>>> mapping somewhere. Where?
>>
>> Where I linked it.
>
> Are you sure? The file you linked has mappings from site ids to language
> codes, not from language codes to language codes. Do you mean to say:
> "If you take only the entries of the form 'XXXwiki' in the list, and
> extract a language code from the XXX, then you get a mapping from
> language codes to language codes that covers all exceptions in
> Wikidata"? This approach would give us:
>
> 'als' : 'gsw',
> 'bat-smg': 'sgs',
> 'be_x_old' : 'be-tarask',
> 'crh': 'crh-latn',
> 'fiu_vro': 'vro',
> 'no' : 'nb',
> 'roa-rup': 'rup',
> 'zh-classical' : 'lzh'
> 'zh-min-nan': 'nan',
> 'zh-yue': 'yue'
>
> Each of the values on the left here also occur as language tags in
> Wikidata, so if we map them, we use the same tag for things that
> Wikidata has distinct tags for. For example, Q27 has a label for yue but
> also for zh-yue [1]. It seems to be wrong to export both of these with
> the same language tag if Wikidata uses them for different purposes.
>
> Maybe this is a bug in Wikidata and we should just not export texts with
> any of the above codes at all (since they always are given by another
> tag directly)?
>
>>
>>> * MediaWiki's http://www.mediawiki.org/wiki/Manual:$wgDummyLanguageCodes
>>> provides some mappings. For example, it maps "zh-yue" to "yue". Yet,
>>> Wikidata use both of these codes. What does this mean?
>>>
>>> Answers to Nemo's points inline:
>>>
>>> On 04/08/13 06:15, Federico Leva (Nemo) wrote:
>>>> Markus Krötzsch, 03/08/2013 15:48:
>
> ...
>
>>>> Apart from the above, doesn't wgLanguageCode in
>>>> https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php
>>>>
>>>> have what you need?
>>>
>>> Interesting. However, the list there does not contain all 300 sites that
>>> we currently find in Wikidata dumps (and some that we do not find there,
>>> including things like dkwiki that seem to be outdated). The full list of
>>> sites we support is also found in the file I mentioned above, just after
>>> the language list (variable siteLanguageCodes).
>>
>> Of course not all wikis are there, that configuration is needed only
>> when the subdomain is "wrong". It's still not clear to me what codes you
>> are considering wrong.
>
> Well, the obvious: if a language used in Wikidata labels or on Wikimedia
> sites has an official IANA code [2], then we should use this code. Every
> other code would be "wrong". For languages that do not have any accurate
> code, we should probably use a private code, following the requirements
> of BCP 47 for private use subtags (in particular, they should have a
> single x somewhere).
>
> This does not seem to be done correctly by my current code. For example,
> we now map 'map_bmswiki' to 'map-bms'. While both 'map' and 'bms' are
> lANA language tags, I am not sure that their combination makes sense.
> The language should be Basa Banyumasan, but bms is for Bilma Kanuri (and
> it is a language code, not a dialect code). Note that map-bms does not
> occur in the file you linked to, so I guess there is some more work to do.
>
> Markus
>
> [1] http://www.wikidata.org/wiki/Special:Export/Q27
> [2]
> http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
>
>
>
_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py#L472
and added a number of TODOs to the most obvious problematic cases.
Typical problems are:
* Malformed language codes ('tokipona')
* Correctly formed language codes without any official meaning (e.g.,
'cbk-zam')
* Correctly formed codes with the wrong meaning (e.g., 'sr-ec': Serbian
from Ecuador?!)
* Language codes with redundant information (e.g., 'kk-cyrl' should be
the same as 'kk' according to IANA, but we have both)
* Use of macrolanguages instead of languages (e.g., "zh" is not
"Mandarin" but just "Chinese"; I guess we mean Mandarin; less sure about
Kurdish ...)
* Language codes with incomplete information (e.g., "sr" should be
"sr-Cyrl" or "sr-Latn", both of which already exist; same for "zh" and
"zh-Hans"/"zh-Hant", but also for "zh-HK" [is this simplified or
traditional?]).
I invite any language experts to look at the file and add
comments/improvements. Some of the issues should possibly also be
considered on the implementation side: we don't want two distinct codes
for the same thing.
Cheers,
Markus
On 04/08/13 16:35, Markus Krötzsch wrote:
> On 04/08/13 13:17, Federico Leva (Nemo) wrote:
>> Markus Krötzsch, 04/08/2013 12:32:
>>> * Wikidata uses "be-x-old" as a code, but MediaWiki messages for this
>>> language seem to use "be-tarask" as a language code. So there must be a
>>> mapping somewhere. Where?
>>
>> Where I linked it.
>
> Are you sure? The file you linked has mappings from site ids to language
> codes, not from language codes to language codes. Do you mean to say:
> "If you take only the entries of the form 'XXXwiki' in the list, and
> extract a language code from the XXX, then you get a mapping from
> language codes to language codes that covers all exceptions in
> Wikidata"? This approach would give us:
>
> 'als' : 'gsw',
> 'bat-smg': 'sgs',
> 'be_x_old' : 'be-tarask',
> 'crh': 'crh-latn',
> 'fiu_vro': 'vro',
> 'no' : 'nb',
> 'roa-rup': 'rup',
> 'zh-classical' : 'lzh'
> 'zh-min-nan': 'nan',
> 'zh-yue': 'yue'
>
> Each of the values on the left here also occur as language tags in
> Wikidata, so if we map them, we use the same tag for things that
> Wikidata has distinct tags for. For example, Q27 has a label for yue but
> also for zh-yue [1]. It seems to be wrong to export both of these with
> the same language tag if Wikidata uses them for different purposes.
>
> Maybe this is a bug in Wikidata and we should just not export texts with
> any of the above codes at all (since they always are given by another
> tag directly)?
>
>>
>>> * MediaWiki's http://www.mediawiki.org/wiki/Manual:$wgDummyLanguageCodes
>>> provides some mappings. For example, it maps "zh-yue" to "yue". Yet,
>>> Wikidata use both of these codes. What does this mean?
>>>
>>> Answers to Nemo's points inline:
>>>
>>> On 04/08/13 06:15, Federico Leva (Nemo) wrote:
>>>> Markus Krötzsch, 03/08/2013 15:48:
>
> ...
>
>>>> Apart from the above, doesn't wgLanguageCode in
>>>> https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php
>>>>
>>>> have what you need?
>>>
>>> Interesting. However, the list there does not contain all 300 sites that
>>> we currently find in Wikidata dumps (and some that we do not find there,
>>> including things like dkwiki that seem to be outdated). The full list of
>>> sites we support is also found in the file I mentioned above, just after
>>> the language list (variable siteLanguageCodes).
>>
>> Of course not all wikis are there, that configuration is needed only
>> when the subdomain is "wrong". It's still not clear to me what codes you
>> are considering wrong.
>
> Well, the obvious: if a language used in Wikidata labels or on Wikimedia
> sites has an official IANA code [2], then we should use this code. Every
> other code would be "wrong". For languages that do not have any accurate
> code, we should probably use a private code, following the requirements
> of BCP 47 for private use subtags (in particular, they should have a
> single x somewhere).
>
> This does not seem to be done correctly by my current code. For example,
> we now map 'map_bmswiki' to 'map-bms'. While both 'map' and 'bms' are
> lANA language tags, I am not sure that their combination makes sense.
> The language should be Basa Banyumasan, but bms is for Bilma Kanuri (and
> it is a language code, not a dialect code). Note that map-bms does not
> occur in the file you linked to, so I guess there is some more work to do.
>
> Markus
>
> [1] http://www.wikidata.org/wiki/Special:Export/Q27
> [2]
> http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
>
>
>
_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
_______________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l