thiemowmde added a comment.
Estimated table sizes:
wbl_lexemes
The latest Item ID is currently Q49977198. Thats 9 bytes.
9 * 3 = 27 bytes per row.
27 * 1 million Lexemes = 26 megabytes.
wbl_lemmata
Lexeme IDs will be similar to Item IDs, so 9 bytes again.
Lets say language codes are 5 bytes on a
thiemowmde added a comment.
wb_terms is plural. Most MediaWiki core tables are plural. I also like plural names for tables more. But in the end it really does not matter.
I used VARBINARY and VARCHAR BINARY as they currently are on other Wikibase tables. From https://dev.mysql.com/doc/refman/5.7/e
Lucas_Werkmeister_WMDE added a comment.
(Minor comment – the MediaWiki database coding conventions prefer singular table names, i. e. wbl_lexeme and wbl_lemma. But I don’t know if there’s a different convention within Wikibase.)
(Also, is the use of VARCHAR BINARY instead of VARBINARY for lem_lemm
thiemowmde added a comment.
@WMDE-leszek, something like this would be my draft:
CREATE TABLE IF NOT EXISTS wbl_lexemes (
lex_lexeme_id VARBINARY(20) NOT NULL PRIMARY KEY,
lex_lexical_category_id VARBINARY(20) NOT NULL,
lex_language_item_id VARBINARY(20) NOT NULL
);
CREATE TABLE IF NOT EXIS
Ladsgroup added a comment.
Our idea would be to have it as memcached.TASK DETAILhttps://phabricator.wikimedia.org/T187775EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: thiemowmde, LadsgroupCc: daniel, Lucas_Werkmeister_WMDE, Ladsgroup, WMDE-leszek, thiemowmde
thiemowmde added a comment.
Personally, I'm totally fine with using any kind of cache, might it be an in-memory one or something else. My worst-case scenario is as follows: Let's say we have 10 million Lexemes, 2 lemmas per Lexeme, 20 bytes per lemma. The cache would need to hold about 0.4 gigabyte
WMDE-leszek added a comment.
I just had a chat with @Ladsgroup and he suggested regarding wbl_lemmas table the following: what about not putting this stuff in the database table but storing all lemmas for display in the cache (or cache them when they're used). I am bit ignorant, but as wbl_lemmas i
WMDE-leszek added a comment.
Regarding number of lemmas per lexeme, @Lucas_Werkmeister_WMDE makes a good point. As far as I remember, @thiemowmde and I talked IRL last week about the number there, and we said something like that the security guesstimate would be to say the total number of lemmas wo
thiemowmde added a comment.
We should fix https://commons.wikimedia.org/wiki/File:Lexeme_data_model.png then, because it very prominently says there is only "one" lemma. It could be this is meant to be interpreted as "one" value that can somehow contain multiple values. I wonder what the benefit of
WMDE-leszek added a comment.
One of the longest words in an English dictionary is "Supercalifragilisticexpialidocious" (34 characters).
General note: English is probably not the best language to look for in the context of long words (even German beats it easily).
In contract to Item labels, the l
daniel added a comment.
Am 20.02.2018 um 15:44 schrieb Lucas_Werkmeister_WMDE:
Lucas_Werkmeister_WMDE added a comment.
There is only one lemma per Lexeme (in only one language)
Don’t we have something to support e. g. “color” and “colour” for the same
lexeme? I’m not sure if that’s two lemmas o
Lucas_Werkmeister_WMDE added a comment.
There is only one lemma per Lexeme (in only one language)
Don’t we have something to support e. g. “color” and “colour” for the same lexeme? I’m not sure if that’s two lemmas or one lemma (multilingual text) with two spellings, but there seems to be some nee
12 matches
Mail list logo