[Wikidata-bugs] [Maniphest] T275068: Get baseline measurements/expectations for splitting lexemes from Wikidata graph

2021-05-12 Thread Gehel
Gehel closed this task as "Resolved".

TASK DETAIL
  https://phabricator.wikimedia.org/T275068

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse, Gehel
Cc: Lydia_Pintscher, DVrandecic, Lucas_Werkmeister_WMDE, Aklapper, MPhamWMF, 
Invadibot, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T275068: Get baseline measurements/expectations for splitting lexemes from Wikidata graph

2021-04-22 Thread MPhamWMF
MPhamWMF updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T275068

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse, MPhamWMF
Cc: Lydia_Pintscher, DVrandecic, Lucas_Werkmeister_WMDE, Aklapper, MPhamWMF, 
Invadibot, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T275068: Get baseline measurements/expectations for splitting lexemes from Wikidata graph

2021-04-21 Thread dcausse
dcausse added a comment.


  In T275068#7021725 , 
@MPhamWMF wrote:
  
  > Thanks, @dcausse!
  > Do you know what percentage of total queries 529097 and 357917 are? I hear 
you on not trusting these numbers, and I think ballparking is fine for now.
  
  Sorry just realized my numbers were completely off (it was scanning the whole 
dataset not just one day...).
  
  So over 225,359,379 queries for March 2021 the simple pattern detected 
206,612 queries involving lexemes (~0.09%).

TASK DETAIL
  https://phabricator.wikimedia.org/T275068

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: Lydia_Pintscher, DVrandecic, Lucas_Werkmeister_WMDE, Aklapper, MPhamWMF, 
Invadibot, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T275068: Get baseline measurements/expectations for splitting lexemes from Wikidata graph

2021-04-20 Thread MPhamWMF
MPhamWMF added a comment.


  Thanks, @dcausse!
  Do you know what percentage of total queries 529097 and 357917 are? I hear 
you on not trusting these numbers, and I think ballparking is fine for now.

TASK DETAIL
  https://phabricator.wikimedia.org/T275068

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse, MPhamWMF
Cc: Lydia_Pintscher, DVrandecic, Lucas_Werkmeister_WMDE, Aklapper, MPhamWMF, 
Invadibot, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T275068: Get baseline measurements/expectations for splitting lexemes from Wikidata graph

2021-04-20 Thread dcausse
dcausse moved this task from In Progress to Needs review on the 
Discovery-Search (Current work) board.
dcausse added a comment.


  > percentage, number of WDQS queries per month that involve Lexemes
  >
  >> percentage, number of the above queries that only involve Lexemes (i.e. 
doesn't require anything from the larger Wikidata graph)
  
  with very naive heuristics and for one day I extracted 529097 queries 
involving lexemes.
  357917 seemed to require data from wikidata but I would not trust this too 
much. Since the language is a wikidata item a query requesting labels in a 
language using its language code rather than its QID falls into the category of 
queries requiring the wikidata graph.
  I did not run the analysis on the full month because it's rather slow and 
given the precision of the heuristics I chose I would not trust these numbers 
anyways.
  
  If we need more precise numbers the analysis will have to be more involved.
  
  For ref, here are the list of predicates I used to detect a `lexeme` query: 
`wikibase:lemma`,   `ontolex:lexicalForm`, `ontolex:representation`,  
`ontolex:LexicalEntry`, `ontolex:sense`,`dct:language`, 
`wikibase:lexicalCategory`, `wikibase:grammaticalFeature`.
  
  > given the current rate of growth of Wikidata, approximately how much time 
it would take for non-Lexeme Wikidata to grow back to its current size
  
  The lexemes RDF dataset is about 77M triples (0.6% of the total size of the 
graph).
  If we were to remove lexemes from the main graph at current growth rate it 
would take ~10days for wikidata to grow back to the equivalent size.
  Note that in the current graph "only" 29316 distinct wikidata items are being 
referenced from the lexemes.

TASK DETAIL
  https://phabricator.wikimedia.org/T275068

WORKBOARD
  https://phabricator.wikimedia.org/project/board/1227/

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: Lydia_Pintscher, DVrandecic, Lucas_Werkmeister_WMDE, Aklapper, MPhamWMF, 
Invadibot, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T275068: Get baseline measurements/expectations for splitting lexemes from Wikidata graph

2021-04-19 Thread dcausse
dcausse claimed this task.

TASK DETAIL
  https://phabricator.wikimedia.org/T275068

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: Lydia_Pintscher, DVrandecic, Lucas_Werkmeister_WMDE, Aklapper, MPhamWMF, 
Invadibot, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T275068: Get baseline measurements/expectations for splitting lexemes from Wikidata graph

2021-03-01 Thread MPhamWMF
MPhamWMF set the point value for this task to "3".

TASK DETAIL
  https://phabricator.wikimedia.org/T275068

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: MPhamWMF
Cc: Lydia_Pintscher, DVrandecic, Lucas_Werkmeister_WMDE, Aklapper, MPhamWMF, 
CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, GoranSMilovanovic, 
QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, 
Xmlizer, abian, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, 
Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T275068: Get baseline measurements/expectations for splitting lexemes from Wikidata graph

2021-02-22 Thread MPhamWMF
MPhamWMF moved this task from All WDQS-related tasks to Current work on the 
Wikidata-Query-Service board.
MPhamWMF added a project: Discovery-Search (Current work).

TASK DETAIL
  https://phabricator.wikimedia.org/T275068

WORKBOARD
  https://phabricator.wikimedia.org/project/board/891/

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: MPhamWMF
Cc: Lydia_Pintscher, DVrandecic, Lucas_Werkmeister_WMDE, Aklapper, MPhamWMF, 
CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, GoranSMilovanovic, 
QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, 
Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T275068: Get baseline measurements/expectations for splitting lexemes from Wikidata graph

2021-02-18 Thread Lydia_Pintscher
Lydia_Pintscher added subscribers: DVrandecic, Lydia_Pintscher.
Lydia_Pintscher added a comment.


  Some additional potentially helpful stats:
  
  - 
https://grafana.wikimedia.org/d/UHV96YJGk/wikidata-datamodel-lexemes?orgId=1=1d
  - https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Statistics

TASK DETAIL
  https://phabricator.wikimedia.org/T275068

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Lydia_Pintscher
Cc: Lydia_Pintscher, DVrandecic, Lucas_Werkmeister_WMDE, Aklapper, MPhamWMF, 
CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, GoranSMilovanovic, 
QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, 
Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T275068: Get baseline measurements/expectations for splitting lexemes from Wikidata graph

2021-02-18 Thread MPhamWMF
MPhamWMF updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T275068

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: MPhamWMF
Cc: Lucas_Werkmeister_WMDE, Aklapper, MPhamWMF, CBogen, Akuckartz, Nandana, 
Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T275068: Get baseline measurements/expectations for splitting lexemes from Wikidata graph

2021-02-18 Thread MPhamWMF
MPhamWMF added a comment.


  > potential upper limit of how many Lexemes there could be
  
  My earlier napkin math on this:
  
  > For the record, as hard it is to quantify words/lemmas/lexemes 
cross-linguistically, I think one needs to know on the order of 10^4 words in a 
language as a typical speaker, with the upper bound of that on the order of 
10^5 [1]. There are ~7k living languages in the world today (~300 Wikipedias), 
so 10^5 x 10^3 = 10^8 is the very generous upper bound for lexemes according to 
my count, 10^4 x 10^2 = 10^6 on the more realistic end of total wiki coverage — 
Trey can correct my napkin math sweat . Anyway, 10^8 I think is the order of 
magnitude of total Wikidata items currently (including lexemes), so there is 
some potential for lexemes to comprise a reasonably sized subgraph if people 
use it.
  
  with Trey's follow up:
  
  > As for Mike's napkin math, it seems to be in the right ballpark for 
sure—but... (many) lexicographers are by their nature completionists, so I 
could see more than 100K lexemes for languages with a very well-established 
lexicographic tradition (i.e., any major world or regional language). English 
Wiktionary has more than 350K nouns, 130K adjectives, and 44K verbs—so > 500K 
total (plus 77K proper nouns.. not sure what to make of that for Wikidata 
Lexemes). I expect we'll only hit that kind of volume for dozens of languages, 
though, at least in the early days.
  > I'm more concerned about forms. (I'm not too worried about senses because I 
think the average number of senses for words in < 2—though set may have as many 
as 150 in the OED—but only ~90 in Wiktionary.) Anyway, back to forms; verbs in 
Romance languages can have ~50 forms. Finnish nouns have ~2200 forms. (OTOH, I 
checked a couple of Finnish lexemes and they have ~10 forms on them.) Not sure 
how forms are represented internally, but there can certainly be a lot of them 
for any given Lexeme.
  >  What happens to the size of the Lexeme subgraph if we assume, say, 20 
forms on average for every Lexeme?

TASK DETAIL
  https://phabricator.wikimedia.org/T275068

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: MPhamWMF
Cc: Lucas_Werkmeister_WMDE, Aklapper, MPhamWMF, CBogen, Akuckartz, Nandana, 
Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T275068: Get baseline measurements/expectations for splitting lexemes from Wikidata graph

2021-02-18 Thread Lucas_Werkmeister_WMDE
Lucas_Werkmeister_WMDE added a comment.


  > percentage, number of Wikidata entities that are Lexemes
  
  Wikidata Datamodel 
 
currently has 92212288 Items, 404946 Lexemes, and 8450 Properties, so that 
would be 0.4% Lexemes across all “top-level” entities. If we include Forms and 
Senses in the count (7053785 and 90003, respectively), all lexicographical 
entities make up 7.6% of all entities.
  
  > percentage, number of Lexemes that are connected to non-Lexeme items in WD
  
  Every Lexeme is connected to at least two Items, its language and lexical 
category. Additionally, Forms typically have several grammatical feature Items.

TASK DETAIL
  https://phabricator.wikimedia.org/T275068

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Lucas_Werkmeister_WMDE
Cc: Lucas_Werkmeister_WMDE, Aklapper, MPhamWMF, CBogen, Akuckartz, Nandana, 
Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T275068: Get baseline measurements/expectations for splitting lexemes from Wikidata graph

2021-02-18 Thread Lucas_Werkmeister_WMDE
Lucas_Werkmeister_WMDE updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T275068

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Lucas_Werkmeister_WMDE
Cc: Aklapper, MPhamWMF, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T275068: Get baseline measurements/expectations for splitting lexemes from Wikidata graph

2021-02-18 Thread MPhamWMF
MPhamWMF updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T275068

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: MPhamWMF
Cc: Aklapper, MPhamWMF, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T275068: Get baseline measurements/expectations for splitting lexemes from Wikidata graph

2021-02-17 Thread MPhamWMF
MPhamWMF added a parent task: T274984: [Epic] Extract lexemes out of WDQS.

TASK DETAIL
  https://phabricator.wikimedia.org/T275068

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: MPhamWMF
Cc: Aklapper, MPhamWMF, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T275068: Get baseline measurements/expectations for splitting lexemes from Wikidata graph

2021-02-17 Thread MPhamWMF
MPhamWMF created this task.
MPhamWMF added projects: Wikidata-Query-Service, Wikidata.
Restricted Application added a subscriber: Aklapper.

TASK DESCRIPTION
  As a product manager for Wikidata and WDQS, I want to know what quantifiable 
benefits to service reliability and quality I might expect to gain (or lose) by 
splitting Lexemes out from the Wikidata graph, so that I can decide whether to 
move ahead with this plan and how to communicate it.
  
  In order to move ahead with splitting out Lexemes from WD, communicate this 
decision, and set expectations around the benefits of implementing this change, 
we should get some baseline measurements of the current state of Lexemes in 
Wikidata and WDQS, and estimates about the effects of splitting them off.
  
  AC:
  Get the numbers for the following metrics:
  
  - percentage, number of Wikidata items that are Lexemes
  - percentage, number of WDQS queries per month that involve Lexemes
- percentage, number of the above queries that only involve Lexemes (i.e. 
doesn't require anything from the larger Wikidata graph)
  - given the current rate of growth of Wikidata, approximately how much time 
it would take for non-Lexeme Wikidata to grow back to its current size
  - potential upper limit of how many Lexemes there could be

TASK DETAIL
  https://phabricator.wikimedia.org/T275068

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: MPhamWMF
Cc: Aklapper, MPhamWMF, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs