[Wikidata-bugs] [Maniphest] T341409: [SW] Use LanguageNameUtils::ALL for monolingual text and lexemes

2023-09-14 Thread Bugreporter
Bugreporter added a subscriber: Nemo_bis.
Bugreporter added a comment.


  For fixing CLDR data, also ping @Nemo_bis.

TASK DETAIL
  https://phabricator.wikimedia.org/T341409

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Bugreporter
Cc: Nemo_bis, Michael, ItamarWMDE, Bugreporter, thiemowmde, 
Lucas_Werkmeister_WMDE, jhsoby, Amire80, Lydia_Pintscher, Manuel, 
mrephabricator, Nikki, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, 
Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
Mahir256, QZanden, srishakatux, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T341409: [SW] Use LanguageNameUtils::ALL for monolingual text and lexemes

2023-09-14 Thread Lucas_Werkmeister_WMDE
Lucas_Werkmeister_WMDE added a comment.


  In T341409#9162199 , 
@Nikki wrote:
  
  > In T341409#9148879 , 
@Lucas_Werkmeister_WMDE wrote:
  >
  >>> - This would make another 230+ languages available, reducing the number 
of languages we have to dump under `mis` (related: T289776 
)
  >>
  >> And if T168799: Integrate IANA language registry with language-data and 
MediaWiki (let MediaWiki "knows" all languages with ISO 639-1/2/3 codes) 
 happens, that would take us the 
rest of the way to T289776: Enable all ISO 639-3 codes on Wikidata 
, right?
  >
  > I don't know. Does T289776  
include labels or not?
  
  Hm, unclear. But it’s a good point that this task is not supposed to include 
labels, since a simple implementation of it (like I was playing around with 
earlier) would affect labels as well; I’ve added that to the task description.
  
  >> The additional cldr language codes are only added when asking for language 
names in a specific language, and the returned language codes vary slightly 
depending on which language you ask for:
  >> [...]
  >> (`de` and `bar` have additionally `en-uk`, with `bar` presumably 
inheriting it from `de` via language fallback; `pt`’s extra language code is 
`az-arab`.) I assume we always want to request the same language here, rather 
than make this depend on the user / request language; should it be the wiki 
content language (`en` on Wikidata), a hard-coded one (e.g. `en` or `qqq`), or 
something else?
  >
  > Hm, that doesn't sound good. Is that actually a bug in the CLDR extension? 
I would expect the set of language codes to be the same regardless of the 
language being used and that not being the case sounds like it would cause 
problems eventually. Perhaps it should have tests to make sure none of the 
files have extra codes that don't exist for English, or perhaps it should 
ignore any codes that aren't defined for all languages? Making the extension 
translatable would help here too, I imagine.
  
  Yeah, that should probably be fixed in the cldr extension. But I’ve convinced 
myself now that we should ask for the `en` language names, so as far as I’m 
concerned this variation is no longer a problem for Wikibase ^^

TASK DETAIL
  https://phabricator.wikimedia.org/T341409

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Lucas_Werkmeister_WMDE
Cc: Michael, ItamarWMDE, Bugreporter, thiemowmde, Lucas_Werkmeister_WMDE, 
jhsoby, Amire80, Lydia_Pintscher, Manuel, mrephabricator, Nikki, 
Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, Mahir256, QZanden, 
srishakatux, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T341409: [SW] Use LanguageNameUtils::ALL for monolingual text and lexemes

2023-09-14 Thread Lucas_Werkmeister_WMDE
Lucas_Werkmeister_WMDE updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T341409

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Lucas_Werkmeister_WMDE
Cc: Michael, ItamarWMDE, Bugreporter, thiemowmde, Lucas_Werkmeister_WMDE, 
jhsoby, Amire80, Lydia_Pintscher, Manuel, mrephabricator, Nikki, 
Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, Mahir256, QZanden, 
srishakatux, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T341409: [SW] Use LanguageNameUtils::ALL for monolingual text and lexemes

2023-09-14 Thread Bugreporter
Bugreporter added a comment.


  > CLDR seems to be defacto used as a repository for names for language codes 
that happen to be used by people. Not at all as an authoritative source for 
language codes. Are we ok with using it anyway?
  
  The CLDR extension currently provides:
  
  - Some terms (including name of days of week, months, languages) in several 
language, which is provided by the Common Locale Data Repository project of the 
Unicode Consortium
  - Additional English name for languages defined in 
https://gerrit.wikimedia.org/g/mediawiki/extensions/cldr/%2B/HEAD/LocalNames/LocalNamesEn.php
  - Additional name of languages in some other languages 
(https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/cldr/+/HEAD/LocalNames),
 which may be severly incomplete (e.g. we have French, German but not Spanish) 
until T231755: Local language name should be translatable in translatewiki 
 is fixed.
  
  The Names.php together with $wmgExtraLanguageNames provides autonyms of 
languages.
  
  The language-data provides more autonyms of languages (all current language 
in Names.php are in language-data but not vice versa). Currently the main use 
of the library is the frontend lan
  guage selector (UniversalLanguageSelector), but it is proposed to replace 
Names.php (T190129 ) and also CLDR 
extension (T281067 ).

TASK DETAIL
  https://phabricator.wikimedia.org/T341409

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Bugreporter
Cc: Michael, ItamarWMDE, Bugreporter, thiemowmde, Lucas_Werkmeister_WMDE, 
jhsoby, Amire80, Lydia_Pintscher, Manuel, mrephabricator, Nikki, 
Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, Mahir256, QZanden, 
srishakatux, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T341409: [SW] Use LanguageNameUtils::ALL for monolingual text and lexemes

2023-09-14 Thread Bugreporter
Bugreporter added a comment.


  > What is that CLDR mentioned in the CLDR extension itself?
  
  https://en.wikipedia.org/wiki/Common_Locale_Data_Repository

TASK DETAIL
  https://phabricator.wikimedia.org/T341409

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Bugreporter
Cc: Michael, ItamarWMDE, Bugreporter, thiemowmde, Lucas_Werkmeister_WMDE, 
jhsoby, Amire80, Lydia_Pintscher, Manuel, mrephabricator, Nikki, 
Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, Mahir256, QZanden, 
srishakatux, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T341409: [SW] Use LanguageNameUtils::ALL for monolingual text and lexemes

2023-09-14 Thread ItamarWMDE
ItamarWMDE updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T341409

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ItamarWMDE
Cc: Michael, ItamarWMDE, Bugreporter, thiemowmde, Lucas_Werkmeister_WMDE, 
jhsoby, Amire80, Lydia_Pintscher, Manuel, mrephabricator, Nikki, 
Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, Mahir256, QZanden, 
srishakatux, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T341409: [SW] Use LanguageNameUtils::ALL for monolingual text and lexemes

2023-09-14 Thread ItamarWMDE
ItamarWMDE added a comment.


  Thank you, I will add the AC you mentioned, but let others who are more 
experienced with CLDR try to clarify the ambiguities you found.

TASK DETAIL
  https://phabricator.wikimedia.org/T341409

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ItamarWMDE
Cc: Michael, ItamarWMDE, Bugreporter, thiemowmde, Lucas_Werkmeister_WMDE, 
jhsoby, Amire80, Lydia_Pintscher, Manuel, mrephabricator, Nikki, 
Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, Mahir256, QZanden, 
srishakatux, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T341409: [SW] Use LanguageNameUtils::ALL for monolingual text and lexemes

2023-09-14 Thread Michael
Michael added a comment.


  I notice, I'm still a bit confused as to where CLDR is getting its languages 
from. Partly from core, partly from a manually maintained list 
(localNamesXX.php), but there are also comments like `# Added to Core, not part 
of CLDR, T287345`. What is that CLDR mentioned in the CLDR extension itself?
  
  In T341409#9148879 , 
@Lucas_Werkmeister_WMDE wrote:
  
  > There is a slight ambiguity in the task description that I didn’t realize 
before. If we take it literally, and only pass LanguageNameUtils::ALL as the 
second getLanguageNames() argument while leaving the first argument the same 
(LanguageNameUtils::AUTONYMS, the default), then we won’t actually see any 
difference
  
  That would be due to
  
if ( $inLanguage !== self::AUTONYMS ) {
# TODO: also include for self::AUTONYMS, when this code 
is more efficient
// @phan-suppress-next-line 
PhanTypeMismatchArgumentNullable False positive

$this->hookRunner->onLanguageGetTranslatedLanguageNames( $names, $inLanguage );
}
  
  in LanguageNameUtils.php. That means when requesting Autonyms, the extra 
languages from CLDR are not loaded.
  
  There seems to be a mistake in the description. The languages in 
CldrNamesEn.php are the MedaWiki ones (that is what rebuild.php uses), the 
//additional// languages that we care about would seem to be the ones coming 
from LocalNamesEn.php 

 and parallel files, right?
  
  In T341409#9162199 , 
@Nikki wrote:
  
  > In T341409#9148879 , 
@Lucas_Werkmeister_WMDE wrote:
  >
  >> The additional cldr language codes are only added when asking for language 
names in a specific language, and the returned language codes vary slightly 
depending on which language you ask for:
  >> [...]
  >> (`de` and `bar` have additionally `en-uk`, with `bar` presumably 
inheriting it from `de` via language fallback; `pt`’s extra language code is 
`az-arab`.) I assume we always want to request the same language here, rather 
than make this depend on the user / request language; should it be the wiki 
content language (`en` on Wikidata), a hard-coded one (e.g. `en` or `qqq`), or 
something else?
  >
  > Hm, that doesn't sound good. Is that actually a bug in the CLDR extension? 
I would expect the set of language codes to be the same regardless of the 
language being used and that not being the case sounds like it would cause 
problems eventually. Perhaps it should have tests to make sure none of the 
files have extra codes that don't exist for English, or perhaps it should 
ignore any codes that aren't defined for all languages? Making the extension 
translatable would help here too, I imagine.
  
  `en-uk` (together with `en-gb`) was added in Add some German translation 
(I0ce22dfc) 
. CLDR 
seems to be defacto used as a repository for names for language codes that 
happen to be used by people.  Not at all as an authoritative source for 
language codes. Are we ok with using it anyway?
  
  Also, I note that a lot of language names that have been added there seem to 
include a comment `# used by Wikidata, T123456`. So we may still want a process 
to add more, given that our current process is how we got to this list.
  
  **Further Async Storywriting notes:**
  
  Needs AC, aside from the one for actually doing the thing, also one or more 
for updating docs/policy/process which exists at least in the following places:
  
  - T312845: [Process] Add new language codes to Wikidata 

  - https://phabricator.wikimedia.org/project/profile/4981/
  - https://www.mediawiki.org/wiki/Manual:Adding_and_removing_languages#Wikibase
  
  Also, should have an AC to go through the existing language related tasks and 
figure out which are still needed, maybe update them, and close the ones no 
longer needed after this one here is done.

TASK DETAIL
  https://phabricator.wikimedia.org/T341409

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Michael
Cc: Michael, ItamarWMDE, Bugreporter, thiemowmde, Lucas_Werkmeister_WMDE, 
jhsoby, Amire80, Lydia_Pintscher, Manuel, mrephabricator, Nikki, 
Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, Mahir256, QZanden, 
srishakatux, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T341409: [SW] Use LanguageNameUtils::ALL for monolingual text and lexemes

2023-09-12 Thread Nikki
Nikki added a comment.


  In T341409#9018965 , 
@thiemowmde wrote:
  
  > I might get this wrong. But as I understand the proposal it would make the 
currently established processes of how languages on wikidata.org are managed, 
requested, and confirmed (briefly described in T312845 
) obsolete.
  
  The documentation would need to be updated but it wouldn't make it completely 
obsolete. `LanguageNameUtils::ALL` is all of the language codes that MediaWiki 
knows about (≈ those it uses itself plus those which CLDR has locale data for), 
but that's still only a fraction of all valid ISO 639/BCP 47 language codes 
(languageinfo 
 has 978, 
ISO 639-3 has 7916), so people would still need a way to request missing codes.
  
  Whether requests for things that are still missing should be accepted by 
Wikidata first or go straight to the CLDR extension depends whether the people 
maintaining the CLDR extension are ok with people making requests for missing 
languages there.
  
  > the basic idea is that there is an "official" working group that 
intentionally reviews and accepts new languages one by one only when they are 
actually needed.
  
  That is still how it's intended to work and it still doesn't work well. The 
people who are being asked to review language codes one by one do not want to. 
People who request codes still have to wait months, if not years. Both @jhsoby 
and @amire80 have asked why we can't just enable all ISO 639-3 codes instead of 
enabling them one by one (or something to that effect), and that's what editors 
have asked for too (T289776 ).
  
  In T341409#9148879 , 
@Lucas_Werkmeister_WMDE wrote:
  
  >> - This would make another 230+ languages available, reducing the number of 
languages we have to dump under `mis` (related: T289776 
)
  >
  > And if T168799: Integrate IANA language registry with language-data and 
MediaWiki (let MediaWiki "knows" all languages with ISO 639-1/2/3 codes) 
 happens, that would take us the 
rest of the way to T289776: Enable all ISO 639-3 codes on Wikidata 
, right?
  
  I don't know. Does T289776  
include labels or not?
  
  I limited this request to monolingual text and lexemes because almost every 
valid language code would be useful in Wikidata for those (lexemes: any known 
word in the language, monolingual text: native label 
 on the language itself, usage 
example  on lexemes, etc). People 
are going to add that data whether the right code is available or not, so if 
MediaWiki already knows a language code exists, I think it makes sense to allow 
it.
  
  > From a technical side, I don’t see major issues with this proposal. But we 
might want to consolidate language name sources; currently, we have some 
`wikibase-lexeme-language-name-*` messages in WikibaseLexeme (but not used by 
Wikibase), and also some languages names in the cldr extension (`LocalNames/` 
directory). Maybe we can make Wikibase fall back to the language code and also 
track the missing language name, so we can have a Grafana board for the most 
frequently used language codes without names. But I think that doesn’t need to 
block this task.
  
  MediaWiki normally shows the language code if it can't find a name, so I 
don't think Wikibase would need to do anything special there, would it?
  
  If I'm not mistaken, it should already be possible to determine which ones 
are missing using wbcontentlanguages 

 (although I recently added all the missing names so you'd need to test it 
locally).
  
  I would be happy to see the names consolidated, they're inconsistent at the 
moment (T322139 ). It's difficult to 
translate the names in the CLDR extension though, but perhaps it could be made 
translatable on translatewiki.net (like I suggested in this year's community 
wishlist 
).
  
  > The additional cldr language codes are only added when asking for language 
names in a specific language, and the returned language codes vary slightly 
depending on which language you ask for:
  > [...]
  > (`de` and `bar` have additionally `en-uk`, with `bar` presumably inheriting 
it from `de` via language fallback; `pt`’s extra language code is `az-arab`.) I 
assume we always want to request the same language here, rather than make this 
depend on the 

[Wikidata-bugs] [Maniphest] T341409: [SW] Use LanguageNameUtils::ALL for monolingual text and lexemes

2023-09-12 Thread ItamarWMDE
ItamarWMDE renamed this task from "Use LanguageNameUtils::ALL for monolingual 
text and lexemes" to "[SW] Use LanguageNameUtils::ALL for monolingual text and 
lexemes".

TASK DETAIL
  https://phabricator.wikimedia.org/T341409

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ItamarWMDE
Cc: ItamarWMDE, Bugreporter, thiemowmde, Lucas_Werkmeister_WMDE, jhsoby, 
Amire80, Lydia_Pintscher, Manuel, mrephabricator, Nikki, Danny_Benjafield_WMDE, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, Mahir256, QZanden, srishakatux, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org