[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

C933103 Wed, 20 Jul 2022 20:28:00 -0700

C933103 added a comment.


  In T236593#8092471 <https://phabricator.wikimedia.org/T236593#8092471>, 
@LucasWerkmeister wrote:
  
  > It’s still not clear to me which problem the `-x-Q123-1` patch is trying to 
solve. Several languages have been mentioned in this task, but which of them 
would benefit from this system? I feel like for several of them, we’ve already 
reached the conclusion that separate forms are in fact the way to go.
  >
  > I’d like to extract a general rule from @Fnielsen’s comment above 
(T236593#5610903 <https://phabricator.wikimedia.org/T236593#5610903>): if you 
need separate statements, then you need separate forms or lexemes. (I think 
this is a sufficient condition, though it might not be a necessary one.) 
Pronunciation (whether pronunciation audio 
<https://www.wikidata.org/wiki/Property:P443> or IPA transcription 
<https://www.wikidata.org/wiki/Property:P898>) is probably the most significant 
kind of statement here: if a speaker would pronounce the spellings differently, 
then they should be different forms – regardless of whether the difference is a 
completely different ending as in octopuses/octopi, or just an extra schwa as 
in aft(e)nen. (I don’t find the hyphenation example as convincing… don’t you 
need a different hyphenation for every spelling variant, even for cases that 
really should just be multiple representations of one form? E.g. co‧lor/co‧lour 
– that could just be multiple statements on the same form, with different 
monolingual text language codes.)
  >
  > I suspect this rule covers the Norwegian example that originally motivated 
this task: I feel like “parametere” and “parametre” are probably pronounced 
differently, much like “aftnen” and “aftenen” are pronounced differently in 
Danish according to Finn. For Vietnamese chữ Nôm, I feel like @mxn’s comment at 
T236593#8024999 <https://phabricator.wikimedia.org/T236593#8024999> goes in a 
similar direction, though I admit I find the whole Chinese-characters part of 
this discussion hard to follow.
  >
  > For the cases where you really only want to have one form with multiple 
representations, I still agree with @daniel’s comment (T236593#5610378 
<https://phabricator.wikimedia.org/T236593#5610378>): “you make up a code for 
each of the spellings”. In practice, the only way to “make up a code” that we 
currently support is to append -x-Q//12345// to an existing, established 
language code. As far as I understand, this solution works well for Hebrew: 
e.g. ספר/סֵפֶר (L67105) <https://www.wikidata.org/wiki/Lexeme:L67105> (the 
“book” word) uses the language codes `he` and `he-x-Q21283070`, where Q21283070 
<https://www.wikidata.org/wiki/Q21283070> represents Tiberian vocalization, the 
orthography with diacritics. At some point, an editorial decision was made that 
the spelling without diacritics “deserves” the unsuffixed `he` language code 
(instead of both spellings using an -x-Q//12345// language code), which I think 
is reasonable: data reusers who don’t care about the different spellings can 
use the most standard language code (`he`) and its single representation per 
form.
  >
  > Allowing people to append an integer number to the item ID adds a second 
way to make up a code, and one that seems less useful to me: without knowing 
what the number means, how do I know which form representation to use? To me 
this runs counter to the goal of “allow[ing] the consumer to choose which 
variant they prefer”. For the languages that appear to need multiple 
representations for the same language code per form (e.g. the Indian languages 
@Mahir256 mentioned in T236593#5608530 
<https://phabricator.wikimedia.org/T236593#5608530>?), is it not possible to 
make the item ID approach work, by creating more special-purpose items? 
Wikidata editors would then make a decision which of the possible spellings 
“deserves” the standard language code, and which additional items need to be 
created (“spelling with character X”, “spelling with sequence Y”?). I 
understand that not all languages have standardized spellings where you can use 
a single item ID to refer to the spelling variants of a wide range of lexemes 
(like in Hebrew), but I think it should still be possible to describe different 
spellings using items that carry more meaning than just a number.
  
  As an English example, some religious people might refuse to write the name 
"God" out directly as it is as this would constitute idolatry. For this we can 
tag it as en-x-Qnnnn for which Qnnnn refer to religious group of people, but 
there are more than one alternative way to write "God". They can either write 
"G-d", "G*d", "G_d", "G-o-d", and so on. It would make no contextual 
differences in whether a hyphen or a underscore is being used, and the change 
in which exact symbol being used in place of original alphabet wouldn't affect 
pronunciation or religious connection. Hence all of these alternatives should 
be tagged en-x-Qnnnn, and with the patch it would be possible to have 
"en-x-Qnnnn-1" being "G-d" while "en-x-Qnnnn-2" being "G*d". I can't see how 
more specific labels can be useful in differentiating "G-d" and "G*d"

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: C933103
Cc: LucasWerkmeister, C933103, AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, 
Mahir256, Danmichaelo, Fnielsen, Lucas_Werkmeister_WMDE, Denny, 
Lydia_Pintscher, jeblad, jhsoby, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331

_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

Reply via email to