[Wikidata-bugs] [Maniphest] T355040: Compare the results of sparql queries between the fullgraph and the subgraphs

2024-03-08 Thread Gehel
Gehel closed this task as "Resolved".

TASK DETAIL
  https://phabricator.wikimedia.org/T355040

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse, Gehel
Cc: Gehel, Aklapper, dcausse, Danny_Benjafield_WMDE, Isabelladantes1983, 
Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, 786, 
Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, 
Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, 
CucyNoiD, Nandana, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, 
Bsandipan, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, 
Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355040: Compare the results of sparql queries between the fullgraph and the subgraphs

2024-03-04 Thread dcausse
dcausse added a comment.


  final report available at 
https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/WDQS_Graph_Split_Impact_Analysis

TASK DETAIL
  https://phabricator.wikimedia.org/T355040

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: Gehel, Aklapper, dcausse, Danny_Benjafield_WMDE, Isabelladantes1983, 
Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, 786, 
Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, 
Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, 
CucyNoiD, Nandana, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, 
Bsandipan, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, 
Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355040: Compare the results of sparql queries between the fullgraph and the subgraphs

2024-02-08 Thread dcausse
dcausse moved this task from In Progress to Needs review on the 
Discovery-Search (Current work) board.
dcausse added a comment.


  Draft report up at 
https://wikitech.wikimedia.org/wiki/User:DCausse/WDQS_Graph_Split_Impact_Analysis

TASK DETAIL
  https://phabricator.wikimedia.org/T355040

WORKBOARD
  https://phabricator.wikimedia.org/project/board/1227/

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: Gehel, Aklapper, dcausse, Danny_Benjafield_WMDE, Isabelladantes1983, 
Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, 786, 
Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, 
Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, 
CucyNoiD, Nandana, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, 
Bsandipan, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, 
Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355040: Compare the results of sparql queries between the fullgraph and the subgraphs

2024-02-02 Thread dcausse
dcausse added a comment.


  WIP:
  
  - included the new 100k queries sample named `QUERY-Q4` from T349512 
 (random sample that is 
representative of the query length and runtime)
  - the % of affected queries (deduplicated) per tool is (//sample// being the 
`QUERY-Q4` sample mentionned above) F41752511: image.png 

  
  The above graph should be taken with a grain of salt as the number of queries 
per datapoints varies a lot (86 queries for //Listeria// vs 85k for 
//random//), these numbers are being reviewed so no conclusions should be drawn 
yet but it does not seem that we obtain the same numbers that were found 
originally in Wikidata_Subgraph_Query_Analysis 

 where 2.5% of the total query count are being identified as requiring 
scholarly articles.
  A more qualitative analysis is in progress:
  
  - analyze of the user agents to understand what usecases are mainly affected, 
preliminary results show that for instance a single UA is the cause of 50% of 
the affected queries
  - extract some SPARQL queries to start evaluating how federation could be 
applied/tested

TASK DETAIL
  https://phabricator.wikimedia.org/T355040

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: Gehel, Aklapper, dcausse, Danny_Benjafield_WMDE, Isabelladantes1983, 
Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, 786, 
Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, 
Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, 
CucyNoiD, Nandana, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, 
Bsandipan, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, 
Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355040: Compare the results of sparql queries between the fullgraph and the subgraphs

2024-01-26 Thread dcausse
dcausse added a comment.


  WIP: 
https://people.wikimedia.org/~dcausse/T355040_EARLY_DRAFT_wdqs_query_results_analysis.html
 (UA redacted for now)
  
  TL/DR:
  
  - added support for identifying true positives (queries with a scientific 
article in the sparql query or in the results)
  - MixNMatch has a very high number of true positives, thus need more 
qualitative analysis (ticket TBD)
  - Listeria does not have any true positives but shows bad outcome (81% 
identical in the best case, 68% worst case), needs more qualitative analysis too

TASK DETAIL
  https://phabricator.wikimedia.org/T355040

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: Gehel, Aklapper, dcausse, Danny_Benjafield_WMDE, Isabelladantes1983, 
Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, 786, 
Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, 
Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, 
CucyNoiD, Nandana, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, 
Bsandipan, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, 
Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355040: Compare the results of sparql queries between the fullgraph and the subgraphs

2024-01-26 Thread CodeReviewBot
CodeReviewBot added a project: Patch-For-Review.
CodeReviewBot added a comment.


  dcausse opened 
https://gitlab.wikimedia.org/repos/search-platform/notebooks/-/merge_requests/1
  
  Draft: early draft of a comparison analysis

TASK DETAIL
  https://phabricator.wikimedia.org/T355040

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse, CodeReviewBot
Cc: Gehel, Aklapper, dcausse, Danny_Benjafield_WMDE, Isabelladantes1983, 
Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, 786, 
Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, 
Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, 
CucyNoiD, Nandana, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, 
Bsandipan, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, 
Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355040: Compare the results of sparql queries between the fullgraph and the subgraphs

2024-01-19 Thread dcausse
dcausse added a comment.


  Quick report on the progress being made:
  
  - Our query logs do not only contains sparql queries and the sparql client 
used to collect the data has to be adapted to support these (ASK, CONSTRUCT, 
DESCRIBE) (https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/991622)
  - Getting failures due to response size, bumped the limit to 16M but still 
getting problems, I might stop here and simply tag & ignore such massive 
queries moving forward
  - Getting very bad numbers from Listeria and MixNMatch (34% and 17% identical 
respectively), avg result size is 1.6k and 8k so might explain partly why 
getting identical results is difficult, need more investigations to understand 
the cause...
  - Getting pretty mediocre numbers for WikidataIntegrator at 88% with very 
small avg result size at 8,  more investigation needed
  - Pywikibot and SPARQLWrapper are good at 99.4% for both

TASK DETAIL
  https://phabricator.wikimedia.org/T355040

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: Gehel, Aklapper, dcausse, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355040: Compare the results of sparql queries between the fullgraph and the subgraphs

2024-01-15 Thread Gehel
Gehel assigned this task to dcausse.

TASK DETAIL
  https://phabricator.wikimedia.org/T355040

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse, Gehel
Cc: Gehel, Aklapper, dcausse, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355040: Compare the results of sparql queries between the fullgraph and the subgraphs

2024-01-15 Thread Gehel
Gehel set the point value for this task to "8".

TASK DETAIL
  https://phabricator.wikimedia.org/T355040

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel
Cc: Gehel, Aklapper, dcausse, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355040: Compare the results of sparql queries between the fullgraph and the subgraphs

2024-01-15 Thread Gehel
Gehel moved this task from Incoming to Current work on the 
Wikidata-Query-Service board.
Gehel added a project: Discovery-Search (Current work).

TASK DETAIL
  https://phabricator.wikimedia.org/T355040

WORKBOARD
  https://phabricator.wikimedia.org/project/board/891/

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel
Cc: Gehel, Aklapper, dcausse, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, 
QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355040: Compare the results of sparql queries between the fullgraph and the subgraphs

2024-01-15 Thread Gehel
Gehel removed a project: Wikidata-Query-Service.

TASK DETAIL
  https://phabricator.wikimedia.org/T355040

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel
Cc: Gehel, Aklapper, dcausse, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331, AWesterinen, Namenlos314, 
Lucas_Werkmeister_WMDE, merbst, Jonas, Xmlizer, jkroll, Jdouglas, Tobias1984, 
Manybubbles
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355040: Compare the results of sparql queries between the fullgraph and the subgraphs

2024-01-15 Thread dcausse
dcausse created this task.
dcausse added projects: Wikidata, Wikidata-Query-Service.

TASK DESCRIPTION
  By using a tool to compare the differences of two results of the same sparql 
query we should evaluate how many queries might "break" when running against 
the wikidata main graph instead of the full graph.
  
  Comparison will use T351819  and 
be based on the sets of sparql extracted in T349512 
.
  
  We should attempt to identify the reasons of the differences and whether they 
are related or unrelated to the split:
  
  - query features dependent on internal ordering the blazegraph btrees (LIMIT 
X OFFSET Y, bd:slice)
  - use of external datasets (federation, mwapi)
  - unicode collation issues (T233204 
)
  - ...add more when discovered
  
  For the queries whose results vary because of the split we should attempt to 
evaluate if targeting scholarly articles is intentional or not (e.g. 
statistical queries with group by counts) and possibly identify the tools and 
their maintainers to contact them to gather feedback on the project.
  
  AC:
  
  - a report is available showing how the current split is going to affect 
queries once run on the wikidata main subgraph
  - a list of affected tools/scripts (when identifiable) that could possibly be 
contacted

TASK DETAIL
  https://phabricator.wikimedia.org/T355040

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: Gehel, Aklapper, dcausse, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, 
QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org