MPhamWMF added a comment.
> potential upper limit of how many Lexemes there could be My earlier napkin math on this: > For the record, as hard it is to quantify words/lemmas/lexemes cross-linguistically, I think one needs to know on the order of 10^4 words in a language as a typical speaker, with the upper bound of that on the order of 10^5 [1]. There are ~7k living languages in the world today (~300 Wikipedias), so 10^5 x 10^3 = 10^8 is the very generous upper bound for lexemes according to my count, 10^4 x 10^2 = 10^6 on the more realistic end of total wiki coverage — Trey can correct my napkin math sweat 😅. Anyway, 10^8 I think is the order of magnitude of total Wikidata items currently (including lexemes), so there is some potential for lexemes to comprise a reasonably sized subgraph if people use it. with Trey's follow up: > As for Mike's napkin math, it seems to be in the right ballpark for sure—but... (many) lexicographers are by their nature completionists, so I could see more than 100K lexemes for languages with a very well-established lexicographic tradition (i.e., any major world or regional language). English Wiktionary has more than 350K nouns, 130K adjectives, and 44K verbs—so > 500K total (plus 77K proper nouns.. not sure what to make of that for Wikidata Lexemes). I expect we'll only hit that kind of volume for dozens of languages, though, at least in the early days. > I'm more concerned about forms. (I'm not too worried about senses because I think the average number of senses for words in < 2—though set may have as many as 150 in the OED—but only ~90 in Wiktionary.) Anyway, back to forms; verbs in Romance languages can have ~50 forms. Finnish nouns have ~2200 forms. (OTOH, I checked a couple of Finnish lexemes and they have ~10 forms on them.) Not sure how forms are represented internally, but there can certainly be a lot of them for any given Lexeme. > What happens to the size of the Lexeme subgraph if we assume, say, 20 forms on average for every Lexeme? TASK DETAIL https://phabricator.wikimedia.org/T275068 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: MPhamWMF Cc: Lucas_Werkmeister_WMDE, Aklapper, MPhamWMF, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs