[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-05-19 Thread WMDE-leszek
WMDE-leszek added a comment.


  I believe we pause with this analysis for now. So I guess we could close this 
task to simplify your task bookkeeping @GoranSMilovanovic ?

TASK DETAIL
  https://phabricator.wikimedia.org/T248308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, WMDE-leszek
Cc: Samantha_Alipio_WMDE, MGerlach, JAllemandou, Lucas_Werkmeister_WMDE, 
Simon_Villeneuve, dcausse, Jakob_WMDE, Gehel, Addshore, Lydia_Pintscher, 
WMDE-leszek, Aklapper, darthmon_wmde, CBogen, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-05-18 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @WMDE-leszek @darthmon_wmde Do we need anything else here in the foreseeable 
future?

TASK DETAIL
  https://phabricator.wikimedia.org/T248308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Samantha_Alipio_WMDE, MGerlach, JAllemandou, Lucas_Werkmeister_WMDE, 
Simon_Villeneuve, dcausse, Jakob_WMDE, Gehel, Addshore, Lydia_Pintscher, 
WMDE-leszek, Aklapper, darthmon_wmde, CBogen, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-27 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  Update `Tue 28 Apr 2020 02:17:33 AM UTC`
  
  Here goes the update report on SPARQL feature selection via XGBoost:
  
  F31783672: WDQS Endpoint Analytics_20200427_B.nb.html 

  
  - The model performance was improved mainly by (a) improving upon the feature 
engineering process (currently: not great, not terrible), and (b) controlling 
for a highly imbalanced design (i.e. the number of queries with "typical" 
processing times heavily outnumber the number of queries with "extreme" 
processing times in the sample) by switching from XGBoost control parameters 
(like `scale_pos_weight`) to a manually implemented Downsampling strategy;
  
  - I have switched from a definition of "extremely long processing time" as an 
extreme outlier to a definition which takes it to be a *mild outlier*: it poses 
a more difficult binary classification problem but still we get significant 
improvements (spot the difference between the Hit and False Alarm rate):
  
  - model **accuracy** is around 85%;
  - **Hit rate** (or True Positive Rate) is around 72%, and
  - **False alarm rate** (or False Positive Rate) is about 13%.
  
  The list of critical SPARQL features (plus what has been extracted as a 
feature from `event.wdqs_external_sparql_query` is found in *Section 4. 
Selected features*.
  
  Good night.

TASK DETAIL
  https://phabricator.wikimedia.org/T248308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: MGerlach, JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, 
Jakob_WMDE, Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, 
darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-27 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  Update `Mon 27 Apr 2020 10:31:05 PM UTC`:
  
  **The most frequently observed SPARQL queries dataset**
  
  - Selection criteria: the query was observed >= 50 times in the WDQS endpoint 
sample (approx. `1M` queries, `2020/04/01` - `2020/04/21`).
  - For each query we report the mean WDQS processing time, the median WDQS 
processing time, and the standard deviation of processing time;
  - the dataset is sorted in descending order of mean WDQS processing time;
  - the `Percent` column stands for the `%` of the total number of queries in 
the sample represented by the respective (repeatedly observed) query and does 
not sum up to `100%` because, again, we report only on the queries that were 
observed 50 or more times on the endpoint.
  
  Here goes the dataset:
  
  F31783527: repeatedQueries_Filter50.csv 

  
  Columns:
  
  - `uniqueSparqlId` - the unique ID of the query - never mind, I need it for 
some join operations on data frames;
  - `sparql` - the query itself
  - `Num_Observations` - how many times was the query observed in the sample;
  - `mean_query_time` - the mean WDQS processing time for this query
  - `median_query_time` - the median WDQS processing time for this query
  - `stdev_query_time` - the standard deviation of the  WDQS processing time 
for this query
  - `Percent` - explained above.

TASK DETAIL
  https://phabricator.wikimedia.org/T248308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: MGerlach, JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, 
Jakob_WMDE, Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, 
darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-27 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  Update `Mon 27 Apr 2020 10:10:23 PM UTC`:
  
  **Final reports**
  
  - Here goes the **Part A** of the Final Report which encompasses the 
Exploratory Data Analysis (EDA) only, encompassing: (1) the characteristics of 
the sample of SPARQL queries used in this study, (2) the overview of the number 
of queries run per (a) day of week, (b) hour of day, (c) WMF Datacenter/Host, 
(d) HTTP method of request, (e) server HTTP response code, and (f) the desired 
output format, (3) the mean and median WDQS query processing times across the 
mentioned (a) - (f) variables, and (4) the distributions of WDQS processing 
times across WMF Datacenter/Hosts and output format.
  
  F31783509: WDQS Endpoint Analytics_20200427_A.nb.html 

  
  **Summary**
  
  - The `eqiad` data center is receiving tons of queries in comparison to 
`codfw`.
  - The `XML` output format seems to take much more to process in comparison to 
`JSON` and `text/plain` (except for we really have only few observations of 
`text/plain` in the sample).
  - The distributions of the WDQS processing time across the crucial variables 
(WMF Datacenter/Host, Output format) are highly skewed towards short processing 
times - so we really need to focus on the outliers seriously (as already did in 
the ML approach).
  
  **Next:**
  
  - share the dataset of most frequently observed SPARQL queries at the WDQS 
endpoint;
  - share Part B of the Report: optimizing the WDQS processing times w. XGBoost 
and features parsed from SPARQL (nothing new, all covered in our meetings with 
thephp.cc, just a wrap-up).

TASK DETAIL
  https://phabricator.wikimedia.org/T248308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: MGerlach, JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, 
Jakob_WMDE, Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, 
darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-26 Thread darthmon_wmde
darthmon_wmde added a comment.


  Thanks a lot, @GoranSMilovanovic for all your hard work!
  
  > @darthmon_wmde @WMDE-leszek At this point, I would put any additional work 
here just as much as it is needed to consolidate the reports, brush-up a detail 
or two, and wrap-up what we have now in a concise manner. Before taking any 
further steps I would suggest that is better to have some decision on this 
approach: a go (we will it use it to optimize WDQS in the future somehow), or a 
no go (we've learned something, but the optimization will rely on a different 
background). Take into your consideration that experiments like the ones that I 
have been conducting in the previous days are costly in terms of both time and 
resources (computational time and memory on our stat100* servers: I've been 
using stat1005 + my own server for computations to avoid overloading our 
infrastructure and save some time). If the decision is a go - we need to 
discuss the real-world implementation (and that will almost certainly mean 
Python/Pyspark development). Thanks.
  
  I would say that, for now, we should wait a bit and decide first how we 
tackle the suggestions coming from thePHP.cc's report in order to prioritise 
the steps and, how you nicely put it, "discuss the real-world implementation".

TASK DETAIL
  https://phabricator.wikimedia.org/T248308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, darthmon_wmde
Cc: MGerlach, JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, 
Jakob_WMDE, Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, 
darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-24 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Addshore
  
  - `queries_vocabulary.csv` - all features extracted from approx. `1M` SPARQL 
queries, 1 - 21. April 2020; statistic: total feature frequency (including 
multiple occurrences of the same feature in a query);
  - `queries_coverage.csv` - all features extracted from approx. `1M` SPARQL 
queries, 1 - 21. April 2020; statistics: 1. number of unique queries in the 
sample that made use of a particular feature; 2. % of unique queries in the 
sample that made use of the respective feature.
  
  Google Drive: datasets 


TASK DETAIL
  https://phabricator.wikimedia.org/T248308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: MGerlach, JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, 
Jakob_WMDE, Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, 
darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-24 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  Update `Fri 24 Apr 2020 04:01:17 AM CEST` and in respect to T248308#6062005 
:
  
  - A new sample of approximately `1M` SPARQL queries was drawn from the new 
events schema 

 : `event.wdqs_external_sparql_query` sugested by @Gehel; the most important 
improvement should be the presence of the `query_time` field which is a more 
precise measure of the WDQS processing time then the `time_firstbyte` variable 
from the previously used `wmf.webrequest` schema
  
  - Sampling: approx. 1% of all queries per day, `2020/04/01` - `2020/04/21`
  
  - Besides the SPARQL queries themselves, the following variables were 
collected from the `event.wdqs_external_sparql_query` table:
- `dt` - timestamp
- `format` - JSON, XML, etc. - the desired output format
- `http.method` - GET, POST, etc.
- `http.status_code` - HTTP status code, not used in the analysis (but it 
could be used as a criterion);
- `backend_host` - backend host
- `datacenter` - data center
- `query_time` - WDQS query processing time;
- unfortunately, the `event.wdqs_external_sparql_query` does not record 
cache status - we have initially planned to use this feature in the analysis 
(see: T248308#6062005 )
  
  - **Goal:** generate a list of the most critical features that influence 
query processing time from what can be extracted from the available SPARQL 
queries
  
  - Feature engineering procedures were slightly improved: still analyzing 
SPARQL as if it was a natural language + new procedures are less error-prone 
than the ones previously used
  
  - Modelling approach: same as before,
- split `query_time` to derive a binary criterion: "typical processing 
time" vs "extreme outlier processing time",
- optimize w. XGBoost, `GBTree` booster - sequential decision trees w. 
automatic feature selection,
- find out what are the most important features that can help sort out the 
two classes of queries by length of processing time,
- run a linear model to determine which of the selected features influence 
the processing times positively (making them longer - slower processing) or  
negatively (making them shorter - faster processing).
  
  - **RESULTS:**
- **None** of the variables obtained from schema and not derived from 
SPARQL queries feature engineering procedures are selected as important for the 
"typical processing time" vs "extreme outlier processing time" dichotomy 
(except: hour of the day, derived from the `dt` timestamp);
- Essentially there are **no differences in comparison to the results 
reported in our April 15 meeting**: an accuracy of approx. 92% can be achieved 
with a True Positive Rate (TPR) of approx. 60% and a very low False Positive 
Rate (FPR) + the model can be improved by selecting a proper decision threshold 
as exemplified in the report in T248308#6057950 
;
- Running a linear model with the selected features against `query_time` in 
seconds as a criterion enables us to see what features influence the WDQS 
processing time positively or negatively: the results are summarized in the 
following table:
  
  F31777475: importanceReg_300.csv 
  
  - and the columns in the table are the following: `feature` - an extracted 
feature from SPARQL, `weigth` - regression coefficient. Because each query is 
characterized by a count of uses of each particular feature it mentions at all, 
the coefficients mean the following: each additional use of a particular 
feature in a query increases its processing time by the value of the 
coefficient. For example, row number 96, feature: `nchar`, which represents the 
length of the query in characters, tells that each additional character in the 
query contributes to an increase in `0.0867218` seconds in query processing 
time.  Another example, row number 68,  feature: `__vars__` which represents 
the number of unique variables instantiated in a query, tells as that with an 
addition of an additional variable the query processing time jumps for 
`5.15947` seconds. This interpretation of the regression coefficients does not 
hold only for the features of the form `f_ds_hour_5` - this particular one 
represents the fact that the query was run between 5 and 6 in the morning UTC. 
We also find `f_ds_hour_1`, `f_ds_hour_22`, and similar in the table. The value 
of the coefficient for these variables tells us how much additional seconds of 
query processing time obtains *relative to the first hour of the day* (i.e. 
`f_ds_hour_0`, which presents a baseline and thus is not found in the table). 
Be aware of the fact that the linear model **is very imprecise in this case**; 
I would interpret the coeff

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-23 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  Update `Fri 24 Apr 2020 04:01:17 AM CEST` and in respect to T248308#6062005 
:
  
  - A new sample of approximately `1M` SPARQL queries was drawn from the new 
events schema 

 : `event.wdqs_external_sparql_query` sugested by @Gehel; the most important 
improvement should be the presence of the `query_time` field which is a more 
precise measure of the WDQS processing time then the `time_firstbyte` variable 
from the previously used `wmf.webrequest` schema
  
  - Sampling: approx. 1% of all queries per day, `2020/04/01` - `2020/04/21`
  
  - Besides the SPARQL queries themselves, the following variables were 
collected from the `event.wdqs_external_sparql_query` table:
- `dt` - timestamp
- `format` - JSON, XML, etc. - the desired output format
- `http.method` - GET, POST, etc.
- `http.status_code` - HTTP status code, not used in the analysis (but it 
could be used as a criterion);
- `backend_host` - backend host
- `datacenter` - data center
- `query_time` - WDQS query processing time;
- unfortunately, the `event.wdqs_external_sparql_query` does not record 
cache status - we have initially planned to use this feature in the analysis 
(see: T248308#6062005 )
  
  - **Goal:** generate a list of the most critical features that influence 
query processing time from what can be extracted from the available SPARQL 
queries
  
  - Feature engineering procedures were slightly improved: still analyzing 
SPARQL as if it was a natural language + new procedures are less error-prone 
than the ones previously used
  
  - Modelling approach: same as before,
- split `query_time` to derive a binary criterion: "typical processing 
time" vs "extreme outlier processing time",
- optimize w. XGBoost, `GBTree` booster - sequential decision trees w. 
automatic feature selection,
- find out what are the most important features that can help sort out the 
two classes of queries by length of processing time,
- run a linear model to determine which of the selected features influence 
the processing times positively (making them longer - slower processing) or  
negatively (making them shorter - faster processing).
  
  - **RESULTS:**
- **None** of the variables obtained from schema and not derived from 
SPARQL queries feature engineering procedures are selected as important for the 
"typical processing time" vs "extreme outlier processing time" dichotomy 
(except: hour of the day, derived from the `dt` timestamp);
- Essentially there are **no differences in comparison to the results 
reported in our April 15 meeting**: an accuracy of approx. 92% can be achieved 
with a True Positive Rate (TPR) of approx. 60% and a very low False Positive 
Rate (FPR) + the model can be improved by selecting a proper decision threshold 
as exemplified in the report in T248308#6057950 
;
- Running a linear model with the selected features against `query_time` in 
seconds as a criterion enables us to see what features influence the WDQS 
processing time positively or negatively: the results are summarized in the 
following table:
  
  F31777475: importanceReg_300.csv 
  
  - and the columns in the table are the following: `feature` - an extracted 
feature from SPARQL, `weigth` - regression coefficient. Because each query is 
characterized by a count of uses of each particular feature it mentions at all, 
the coefficients mean the following: each additional use of a particular 
feature in a query increases its processing time by the value of the 
coefficient. For example, row number 96, feature: `nchar`, which represents the 
length of the query in characters, tells that each additional character in the 
query contributes to an increase in `0.0867218` seconds in query processing 
time.  Another example, row number 68,  feature: `__vars__` which represents 
the number of unique variables instantiated in a query, tells as that with an 
addition of an additional variable the query processing time jumps for 
`5.15947` seconds. This interpretation of the regression coefficients does not 
hold only for the features of the form `f_ds_hour_5` - this particular one 
represents the fact that the query was run between 5 and 6 in the morning UTC. 
We also find `f_ds_hour_1`, `f_ds_hour_22`, and similar in the table. The value 
of the coefficient for these variables tells us how much additional seconds of 
query processing time obtains *relative to the first hour of the day* (i.e. 
`f_ds_hour_0`, which presents a baseline and thus is not found in the table). 
Be aware of the fact that the linear model **is very imprecise in this case**; 
I would interpret the coeff

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-20 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Gehel First of all, thank you for all the insights that you have brought 
into the discussion thus far.
  
  > There is probably better / more useful information published as part of the 
new events 

 published directly from WDQS.
  
  Indeed the schema that you have referred to seems to be more suitable to work 
with than `wmf.webrequest`. Please, if you, or @dcausse or @JAllemandou can 
help me clarify the following:
  
  - What is the difference between `event.wdqs_internal_sparql_query` and 
`event.wdqs_external_sparql_query` in the WMD Data Lake?
  - Could you confirm that the `query_time` field represents "complete query 
time" as described by @Gehel in T248308#6062499 
 (above)?
  
  Thanks.
  
  > it's fairly easy to add metrics to the events from WDQS, maybe we should 
add the current concurrency levels, and maybe the current CPU load. Open a 
ticket for us if you think that might be useful.
  
  Let's see what can be inferred from the variables present in the new schemata 
now. Then we can decide together if it would be beneficial for us to have any 
new metrics added there.

TASK DETAIL
  https://phabricator.wikimedia.org/T248308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: MGerlach, JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, 
Jakob_WMDE, Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, 
darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-16 Thread Gehel
Gehel added a comment.


  A few additional notes:
  
  - There is probably better / more useful information published as part of the 
new events 

 published directly from WDQS. Check directly with @dcausse or @JAllemandou if 
needed.
  - complete query time is probably a better predictor of resource usage than 
TTFB (results are streamed, computation still happens after first byte).
  - it's fairly easy to add metrics to the events from WDQS, maybe we should 
add the current concurrency levels, and maybe the current CPU load. Open a 
ticket for us if you think that might be useful.

TASK DETAIL
  https://phabricator.wikimedia.org/T248308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Gehel
Cc: JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, Jakob_WMDE, 
Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, darthmon_wmde, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-16 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  Update `Thu Apr 16 10:21:32 UTC 2020`:
  
  - following the meeting with thephp.cc yesterday:
- The modelling approach will change from more predictive to more 
explanatory, i.e. the variables that could not be used for prediction 
(`cache_status`, for example) will be encompassed in order to study the WDQS 
responses in more detail;
- A model to estimate the effect of the number of concurrently running 
queries on the server response time will be developed;
- All planned improvements of the currently used family of XGBoost models 
(feature engineering fixes in the first place) will be implemented;
- Next meeting to be scheduled for sometime near the end of the following 
week.

TASK DETAIL
  https://phabricator.wikimedia.org/T248308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, Jakob_WMDE, 
Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, darthmon_wmde, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-15 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  Update `wed, 15. apr 2020.  09:56:39 CEST`
  
  - First report on modelling results, to be discussed in a meeting `10:00 
CEST` today.
  
  F31757331: WDQS Endpoint Analytics_20200414.nb.html 


TASK DETAIL
  https://phabricator.wikimedia.org/T248308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, Jakob_WMDE, 
Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, darthmon_wmde, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-09 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  Update `Thu 09 Apr 2020 10:19:24 PM UTC`:
  
  - XGBoost w. `gbtree` on a binary classification problem ("typical" vs. 
"extreme outlier" server response times) cross-validation started on 
**stat1005**;
  - using 9 data sets with varying number of features (<100 - 2000);
  - splitting test from train data for each data set;
  - running `xgboost` internal cross-validation controls;
  - cross-validating across: learning rate (`eta`, 4 levels), subsample (rows, 
4 levels) parameter to build trees, `max_depth` (how deep trees are allowed, 4 
levels);
  - number of iterations set to monotonically decrease with `eta`;
  - keeping `colsample_bytree` (proportion of features used to build each tree) 
fixed at .5;
  - setting `max_delta_step` to 1 - documented to be useful for highly 
unbalanced designs in binary classification (as ours is);
  - model selection: ROC Analysis -> AUC.
  
  Resource consumption: 32 cores, approx. 15Gb RAM.
  Approximate running time guesstimate: 24 - 30h.

TASK DETAIL
  https://phabricator.wikimedia.org/T248308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, Jakob_WMDE, 
Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, darthmon_wmde, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-06 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  - Update `Mon 06 Apr 2020 04:54:47 PM CEST`: modeling extreme outliers on 
server response time (based on `time_firstbyte` from `wmf.webrequest`): **95%** 
accuracy on both train and held out test data set.
  - Note: consider re-formulating the problem as a multi-class labeling + 
softmax function optimization.

TASK DETAIL
  https://phabricator.wikimedia.org/T248308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, Jakob_WMDE, 
Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, darthmon_wmde, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-04-06 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  Current status:
  
  - pilot/research experiments completed:
- research phase:
- model server response times from the features extracted as atomic 
elements of the SPARQL queries in the sample;
- experimented with various feature selections (size of the feature 
vocabulary);
- model: XGBoost for regression, RMSE optimization;
- results: everything between approx. R = .72 (test data set) and R = .91 
(train data set) can be achieved;
  
  - firs serious model:
- goal: categorize unusually long server response times (> upper inner 
fence, Q3 + 1.5*IQR - "mild outliers");
- method: XGBoost optimization of logistic loss (i.e. say Binomial 
Regression from an ensemble of Decision Trees);
- result: accuracy **92%** on both train and test data (approx. 50% split 
of 1M queries in the sample).
  
  NEXT steps:
  
  - running full CV cycles across learning rate, tree depth, taking best 
iterations in n-fold CVs only;
  - singling out the most reliable model;
  - attempt to predict extreme outliers (> upper inner fence, Q3 + 3*IQR - 
"extreme outliers");
  - reporting until Wednesday, 2020/04/09;
  - clustering queries from the most important features in server response time 
optimization (if necessary - to discuss with the team).

TASK DETAIL
  https://phabricator.wikimedia.org/T248308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, Jakob_WMDE, 
Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, darthmon_wmde, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-03-31 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  Current status:
  
  - parsed SPARQL; initial, approximate feature engineering phase completed;
  - NEXT: validating the features by optimizing server response time 
(`time_firstbyte` from `wmf.webrequest`) by
  - XGBoost with cross-validation;
  - goal: select the number of features, varying by `%` of queries that used a 
particular feature, starting from `1%`;
  - experiments ongoing on `stat1005`.

TASK DETAIL
  https://phabricator.wikimedia.org/T248308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Jakob_WMDE, Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, 
darthmon_wmde, Nandana, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, 
QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, 
Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-03-27 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  `Fri 27 Mar 2020 11:16:16 AM UTC`
  
  - incorrect HiveQL sampling fixed (the first sample encompassed only queries 
from the first hour of each day from `2020-03-01` to `2020-03-20`);
  - the new sample, encompassing approximately 1% of all queries from 
`2020-03-01` to `2020-03-20`, encompasses `1,095,712` queries and `815542` 
unique queries.

TASK DETAIL
  https://phabricator.wikimedia.org/T248308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, darthmon_wmde, Nandana, 
Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

2020-03-26 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  `Thu 26 Mar 2020 11:35:59 PM UTC`
  
  - a sample of SPARQL queries from `wmf.webrequest` was obtained by randomly 
sampling 1% of all queries that were sent out to WDQS on each day from 
`2020-03-01` to `2020-03-20`;
  - the sample is now cleaned by removing all `http_status == 4**` (client-side 
errors; I guess we are not interested in malformed queries), checked for 
consistency, and `URLdecoded()` so that it encompasses only SPARQL code in the 
`uri_query` field;
  - next steps:
- exploratory data analysis;
- feature engineering: describing queries from their SPARQL language 
constituents.

TASK DETAIL
  https://phabricator.wikimedia.org/T248308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, darthmon_wmde, Nandana, 
Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs