C933103 added a comment.
In T236593#8092471 <https://phabricator.wikimedia.org/T236593#8092471>, @LucasWerkmeister wrote: > It’s still not clear to me which problem the `-x-Q123-1` patch is trying to solve. Several languages have been mentioned in this task, but which of them would benefit from this system? I feel like for several of them, we’ve already reached the conclusion that separate forms are in fact the way to go. > > I’d like to extract a general rule from @Fnielsen’s comment above (T236593#5610903 <https://phabricator.wikimedia.org/T236593#5610903>): if you need separate statements, then you need separate forms or lexemes. (I think this is a sufficient condition, though it might not be a necessary one.) Pronunciation (whether pronunciation audio <https://www.wikidata.org/wiki/Property:P443> or IPA transcription <https://www.wikidata.org/wiki/Property:P898>) is probably the most significant kind of statement here: if a speaker would pronounce the spellings differently, then they should be different forms – regardless of whether the difference is a completely different ending as in octopuses/octopi, or just an extra schwa as in aft(e)nen. (I don’t find the hyphenation example as convincing… don’t you need a different hyphenation for every spelling variant, even for cases that really should just be multiple representations of one form? E.g. co‧lor/co‧lour – that could just be multiple statements on the same form, with different monolingual text language codes.) > > I suspect this rule covers the Norwegian example that originally motivated this task: I feel like “parametere” and “parametre” are probably pronounced differently, much like “aftnen” and “aftenen” are pronounced differently in Danish according to Finn. For Vietnamese chữ Nôm, I feel like @mxn’s comment at T236593#8024999 <https://phabricator.wikimedia.org/T236593#8024999> goes in a similar direction, though I admit I find the whole Chinese-characters part of this discussion hard to follow. > > For the cases where you really only want to have one form with multiple representations, I still agree with @daniel’s comment (T236593#5610378 <https://phabricator.wikimedia.org/T236593#5610378>): “you make up a code for each of the spellings”. In practice, the only way to “make up a code” that we currently support is to append -x-Q//12345// to an existing, established language code. As far as I understand, this solution works well for Hebrew: e.g. ספר/סֵפֶר (L67105) <https://www.wikidata.org/wiki/Lexeme:L67105> (the “book” word) uses the language codes `he` and `he-x-Q21283070`, where Q21283070 <https://www.wikidata.org/wiki/Q21283070> represents Tiberian vocalization, the orthography with diacritics. At some point, an editorial decision was made that the spelling without diacritics “deserves” the unsuffixed `he` language code (instead of both spellings using an -x-Q//12345// language code), which I think is reasonable: data reusers who don’t care about the different spellings can use the most standard language code (`he`) and its single representation per form. > > Allowing people to append an integer number to the item ID adds a second way to make up a code, and one that seems less useful to me: without knowing what the number means, how do I know which form representation to use? To me this runs counter to the goal of “allow[ing] the consumer to choose which variant they prefer”. For the languages that appear to need multiple representations for the same language code per form (e.g. the Indian languages @Mahir256 mentioned in T236593#5608530 <https://phabricator.wikimedia.org/T236593#5608530>?), is it not possible to make the item ID approach work, by creating more special-purpose items? Wikidata editors would then make a decision which of the possible spellings “deserves” the standard language code, and which additional items need to be created (“spelling with character X”, “spelling with sequence Y”?). I understand that not all languages have standardized spellings where you can use a single item ID to refer to the spelling variants of a wide range of lexemes (like in Hebrew), but I think it should still be possible to describe different spellings using items that carry more meaning than just a number. As an English example, some religious people might refuse to write the name "God" out directly as it is as this would constitute idolatry. For this we can tag it as en-x-Qnnnn for which Qnnnn refer to religious group of people, but there are more than one alternative way to write "God". They can either write "G-d", "G*d", "G_d", "G-o-d", and so on. It would make no contextual differences in whether a hyphen or a underscore is being used, and the change in which exact symbol being used in place of original alphabet wouldn't affect pronunciation or religious connection. Hence all of these alternatives should be tagged en-x-Qnnnn, and with the patch it would be possible to have "en-x-Qnnnn-1" being "G-d" while "en-x-Qnnnn-2" being "G*d". I can't see how more specific labels can be useful in differentiating "G-d" and "G*d" TASK DETAIL https://phabricator.wikimedia.org/T236593 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: C933103 Cc: LucasWerkmeister, C933103, AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org