[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-05-19 Thread WMDE-leszek
WMDE-leszek added a comment. I believe we pause with this analysis for now. So I guess we could close this task to simplify your task bookkeeping @GoranSMilovanovic ? TASK DETAIL https://phabricator.wikimedia.org/T248308 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-05-18 Thread GoranSMilovanovic
GoranSMilovanovic added a comment. @WMDE-leszek @darthmon_wmde Do we need anything else here in the foreseeable future? TASK DETAIL https://phabricator.wikimedia.org/T248308 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic Cc: Sa

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-27 Thread GoranSMilovanovic
GoranSMilovanovic added a comment. Update `Tue 28 Apr 2020 02:17:33 AM UTC` Here goes the update report on SPARQL feature selection via XGBoost: F31783672: WDQS Endpoint Analytics_20200427_B.nb.html - The model performance was improv

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-27 Thread GoranSMilovanovic
GoranSMilovanovic added a comment. Update `Mon 27 Apr 2020 10:31:05 PM UTC`: **The most frequently observed SPARQL queries dataset** - Selection criteria: the query was observed >= 50 times in the WDQS endpoint sample (approx. `1M` queries, `2020/04/01` - `2020/04/21`). - For each

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-27 Thread GoranSMilovanovic
GoranSMilovanovic added a comment. Update `Mon 27 Apr 2020 10:10:23 PM UTC`: **Final reports** - Here goes the **Part A** of the Final Report which encompasses the Exploratory Data Analysis (EDA) only, encompassing: (1) the characteristics of the sample of SPARQL queries used in thi

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-26 Thread darthmon_wmde
darthmon_wmde added a comment. Thanks a lot, @GoranSMilovanovic for all your hard work! > @darthmon_wmde @WMDE-leszek At this point, I would put any additional work here just as much as it is needed to consolidate the reports, brush-up a detail or two, and wrap-up what we have now in a c

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-24 Thread GoranSMilovanovic
GoranSMilovanovic added a comment. @Addshore - `queries_vocabulary.csv` - all features extracted from approx. `1M` SPARQL queries, 1 - 21. April 2020; statistic: total feature frequency (including multiple occurrences of the same feature in a query); - `queries_coverage.csv` - all feat

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-24 Thread GoranSMilovanovic
GoranSMilovanovic added a comment. Update `Fri 24 Apr 2020 04:01:17 AM CEST` and in respect to T248308#6062005 : - A new sample of approximately `1M` SPARQL queries was drawn from the new events schema

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-23 Thread GoranSMilovanovic
GoranSMilovanovic added a comment. Update `Fri 24 Apr 2020 04:01:17 AM CEST` and in respect to T248308#6062005 : - A new sample of approximately `1M` SPARQL queries was drawn from the new events schema

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-20 Thread GoranSMilovanovic
GoranSMilovanovic added a comment. @Gehel First of all, thank you for all the insights that you have brought into the discussion thus far. > There is probably better / more useful information published as part of the new events

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-16 Thread Gehel
Gehel added a comment. A few additional notes: - There is probably better / more useful information published as part of the new events published directly from WDQS. Chec

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-16 Thread GoranSMilovanovic
GoranSMilovanovic added a comment. Update `Thu Apr 16 10:21:32 UTC 2020`: - following the meeting with thephp.cc yesterday: - The modelling approach will change from more predictive to more explanatory, i.e. the variables that could not be used for prediction (`cache_status`, for exa

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-15 Thread GoranSMilovanovic
GoranSMilovanovic added a comment. Update `wed, 15. apr 2020. 09:56:39 CEST` - First report on modelling results, to be discussed in a meeting `10:00 CEST` today. F31757331: WDQS Endpoint Analytics_20200414.nb.html TASK DETAIL https:

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-09 Thread GoranSMilovanovic
GoranSMilovanovic added a comment. Update `Thu 09 Apr 2020 10:19:24 PM UTC`: - XGBoost w. `gbtree` on a binary classification problem ("typical" vs. "extreme outlier" server response times) cross-validation started on **stat1005**; - using 9 data sets with varying number of features (<

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-06 Thread GoranSMilovanovic
GoranSMilovanovic added a comment. - Update `Mon 06 Apr 2020 04:54:47 PM CEST`: modeling extreme outliers on server response time (based on `time_firstbyte` from `wmf.webrequest`): **95%** accuracy on both train and held out test data set. - Note: consider re-formulating the problem as a mu

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-06 Thread GoranSMilovanovic
GoranSMilovanovic added a comment. Current status: - pilot/research experiments completed: - research phase: - model server response times from the features extracted as atomic elements of the SPARQL queries in the sample; - experimented with various feature selections (size o

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-03-31 Thread GoranSMilovanovic
GoranSMilovanovic added a comment. Current status: - parsed SPARQL; initial, approximate feature engineering phase completed; - NEXT: validating the features by optimizing server response time (`time_firstbyte` from `wmf.webrequest`) by - XGBoost with cross-validation; - goal: selec

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-03-27 Thread GoranSMilovanovic
GoranSMilovanovic added a comment. `Fri 27 Mar 2020 11:16:16 AM UTC` - incorrect HiveQL sampling fixed (the first sample encompassed only queries from the first hour of each day from `2020-03-01` to `2020-03-20`); - the new sample, encompassing approximately 1% of all queries from `202

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-03-26 Thread GoranSMilovanovic
GoranSMilovanovic added a comment. `Thu 26 Mar 2020 11:35:59 PM UTC` - a sample of SPARQL queries from `wmf.webrequest` was obtained by randomly sampling 1% of all queries that were sent out to WDQS on each day from `2020-03-01` to `2020-03-20`; - the sample is now cleaned by removing