mrephabricator added a comment.
This may be verging on pedantry, but I will say that the principle of "one form per combination of grammatical features" does not sound broadly applicable enough to follow for each language. Maybe I am missing something and this is just a convention for certain languages. In any case, here are some examples which illustrate where this would not be a helpful model. In Punjabi, an alternate form with identical grammatical features could represent any combination of the following: - An alternative pronunciation of the same form, represented by mutual "alternative form" property links without mutual "homophone form" links - An alternative spelling of the same form in any or all of the spelling variants/orthographies represented, represented by mutual "alternative form" links and mutual "homophone form" links. - If the the spelling varies only for one representation--which actually is not as common as I initially expected--the other representation(s) are duplicated exactly. This may seem somewhat tedious, but for the time being it is an effective way to store the useful information that where spelling varies in one writing system, only one spelling is accepted in the other. - Dialectal or regional variants of the same form, most often simply indicated with "variety of form" set to "unknown value," as usually no empirical evidence exists to assign the form to a specific named dialect or say anything more specific than "this form will vary depending on who you talk to." - Shortened or contracted variants of the same form, indicated with mutual "alternative form" property links and "short form" as a grammatical feature on the shorter form. - Versions of forms which are only for use in spoken language / dialogue as opposed to versions of forms which are only used in writing. For example, for some forms on a Punjabi verb, the form will get inflected twice for grammatical number and/or person, once on an infixed part of the form, and once on the suffixed ending of the form, but in spoken/colloquial language it is acceptable to use a form which is only inflected once. Notably all of the above will only apply to particular inflections of a given lexeme. If we take this verb for example, https://www.wikidata.org/wiki/Lexeme:L688582 , there are 30 forms with "alternate forms" that share grammatical features with another so far out of the 99 forms documented. If we were to create 30 separate lexemes to represent this 1 word, how would we represent the rest of the context that is important for understanding what these inflections represent, or indicate for example that ਹਸਾਏਂਗੀ and ਹਸਾਵੇਂਗੀ are interchangeable spelling + pronunciation options for second person + feminine + singular + additive + causative + subjunctive + definite, but that only ਹਸਾਵਾਂਗੀ is acceptable as a spelling + pronunciation option for first person + feminine + singular + additive + causative + subjunctive + definite? On other lexemes, the same grammatical feature combination may permit variation. (This is ultimately governed by the final phoneme of the root in a verb which only ever applies to the gender-inflected, written/formal first person subjunctive definite forms.) That would be an unsustainable model. I am relatively conservative about what constitutes a separate lexeme; I tend to base it primarily on a combination of part of speech + mode of derivation rather than pronunciation or spelling variation, especially since the latter factors generally don't have any bearing on how and where a lexeme can be used according to the internal logic of the language. I am inclined to agree that the numbered Q-item language code patch is hard to discern the specific purpose. I think what may be the case here is that each of the concerns brought up in this thread have different solutions. Theoretically, there is no upper limit on the number of variations a form can have, and it could become confusing if languages started to have long vertical strips of representations, some of which are governed by a consistent heuristic, and some of which are arbitrary. What may be productive is the addition of various properties for use on lexeme forms which offer more nuanced ways to model the different languages discussed here. TASK DETAIL https://phabricator.wikimedia.org/T236593 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: mrephabricator Cc: mrephabricator, LucasWerkmeister, C933103, AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org