[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

mrephabricator Thu, 04 Aug 2022 00:43:55 -0700

mrephabricator added a comment.


  This may be verging on pedantry, but I will say that the principle of "one 
form per combination of grammatical features" does not sound broadly applicable 
enough to follow for each language. Maybe I am missing something and this is 
just a convention for certain languages.
  
  In any case, here are some examples which illustrate where this would not be 
a helpful model. In Punjabi, an alternate form with identical grammatical 
features could represent any combination of the following:
  
  - An alternative pronunciation of the same form, represented by mutual 
"alternative form" property links without mutual "homophone form" links
  - An alternative spelling of the same form in any or all of the spelling 
variants/orthographies represented, represented by mutual "alternative form" 
links and mutual "homophone form" links.
    - If the the spelling varies only for one representation--which actually is 
not as common as I initially expected--the other representation(s) are 
duplicated exactly. This may seem somewhat tedious, but for the time being it 
is an effective way to store the useful information that where spelling varies 
in one writing system, only one spelling is accepted in the other.
  - Dialectal or regional variants of the same form, most often simply 
indicated with "variety of form" set to "unknown value," as usually no 
empirical evidence exists to assign the form to a specific named dialect or say 
anything more specific than "this form will vary depending on who you talk to."
  - Shortened or contracted variants of the same form, indicated with mutual 
"alternative form" property links and "short form" as a grammatical feature on 
the shorter form.
  - Versions of forms which are only for use in spoken language / dialogue as 
opposed to versions of forms which are only used in writing. For example, for 
some forms on a Punjabi verb, the form will get inflected twice for grammatical 
number and/or person, once on an infixed part of the form, and once on the 
suffixed ending of the form, but in spoken/colloquial language it is acceptable 
to use a form which is only inflected once.
  
  Notably all of the above will only apply to particular inflections of a given 
lexeme. If we take this verb for example, 
https://www.wikidata.org/wiki/Lexeme:L688582 , there are 30 forms with 
"alternate forms" that share grammatical features with another so far out of 
the 99 forms documented. If we were to create 30 separate lexemes to represent 
this 1 word, how would we represent the rest of the context that is important 
for understanding what these inflections represent, or indicate for example 
that ਹਸਾਏਂਗੀ and ਹਸਾਵੇਂਗੀ are interchangeable spelling + pronunciation options 
for second person + feminine + singular + additive + causative + subjunctive + 
definite, but that only ਹਸਾਵਾਂਗੀ is acceptable as a spelling + pronunciation 
option for first person + feminine + singular + additive + causative + 
subjunctive + definite? On other lexemes, the same grammatical feature 
combination may permit variation. (This is ultimately governed by the final 
phoneme of the root in a verb which only ever applies to the gender-inflected, 
written/formal first person subjunctive definite forms.) That would be an 
unsustainable model. I am relatively conservative about what constitutes a 
separate lexeme; I tend to base it primarily on a combination of part of speech 
+ mode of derivation rather than pronunciation or spelling variation, 
especially since the latter factors generally don't have any bearing on how and 
where a lexeme can be used according to the internal logic of the language.
  
  I am inclined to agree that the numbered Q-item language code patch is hard 
to discern the specific purpose. I think what may be the case here is that each 
of the concerns brought up in this thread have different solutions. 
Theoretically, there is no upper limit on the number of variations a form can 
have, and it could become confusing if languages started to have long vertical 
strips of representations, some of which are governed by a consistent 
heuristic, and some of which are arbitrary. What may be productive is the 
addition of various properties for use on lexeme forms which offer more nuanced 
ways to model the different languages discussed here.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mrephabricator
Cc: mrephabricator, LucasWerkmeister, C933103, AGutman-WMF, mxn, So9q, Ijon, 
daniel, Asaf, Mahir256, Danmichaelo, Fnielsen, Lucas_Werkmeister_WMDE, Denny, 
Lydia_Pintscher, jeblad, jhsoby, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331

_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

Reply via email to