MPhamWMF added a comment.

  > potential upper limit of how many Lexemes there could be
  
  My earlier napkin math on this:
  
  > For the record, as hard it is to quantify words/lemmas/lexemes 
cross-linguistically, I think one needs to know on the order of 10^4 words in a 
language as a typical speaker, with the upper bound of that on the order of 
10^5 [1]. There are ~7k living languages in the world today (~300 Wikipedias), 
so 10^5 x 10^3 = 10^8 is the very generous upper bound for lexemes according to 
my count, 10^4 x 10^2 = 10^6 on the more realistic end of total wiki coverage — 
Trey can correct my napkin math sweat 😅. Anyway, 10^8 I think is the order of 
magnitude of total Wikidata items currently (including lexemes), so there is 
some potential for lexemes to comprise a reasonably sized subgraph if people 
use it.
  
  with Trey's follow up:
  
  > As for Mike's napkin math, it seems to be in the right ballpark for 
sure—but... (many) lexicographers are by their nature completionists, so I 
could see more than 100K lexemes for languages with a very well-established 
lexicographic tradition (i.e., any major world or regional language). English 
Wiktionary has more than 350K nouns, 130K adjectives, and 44K verbs—so > 500K 
total (plus 77K proper nouns.. not sure what to make of that for Wikidata 
Lexemes). I expect we'll only hit that kind of volume for dozens of languages, 
though, at least in the early days.
  > I'm more concerned about forms. (I'm not too worried about senses because I 
think the average number of senses for words in < 2—though set may have as many 
as 150 in the OED—but only ~90 in Wiktionary.) Anyway, back to forms; verbs in 
Romance languages can have ~50 forms. Finnish nouns have ~2200 forms. (OTOH, I 
checked a couple of Finnish lexemes and they have ~10 forms on them.) Not sure 
how forms are represented internally, but there can certainly be a lot of them 
for any given Lexeme.
  >  What happens to the size of the Lexeme subgraph if we assume, say, 20 
forms on average for every Lexeme?

TASK DETAIL
  https://phabricator.wikimedia.org/T275068

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: MPhamWMF
Cc: Lucas_Werkmeister_WMDE, Aklapper, MPhamWMF, CBogen, Akuckartz, Nandana, 
Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to