GoranSMilovanovic added a comment.

  Current status:
  
  - pilot/research experiments completed:
    - research phase:
    - model server response times from the features extracted as atomic 
elements of the SPARQL queries in the sample;
    - experimented with various feature selections (size of the feature 
vocabulary);
    - model: XGBoost for regression, RMSE optimization;
    - results: everything between approx. R = .72 (test data set) and R = .91 
(train data set) can be achieved;
  
  - firs serious model:
    - goal: categorize unusually long server response times (> upper inner 
fence, Q3 + 1.5*IQR - "mild outliers");
    - method: XGBoost optimization of logistic loss (i.e. say Binomial 
Regression from an ensemble of Decision Trees);
    - result: accuracy **92%** on both train and test data (approx. 50% split 
of 1M queries in the sample).
  
  NEXT steps:
  
  - running full CV cycles across learning rate, tree depth, taking best 
iterations in n-fold CVs only;
  - singling out the most reliable model;
  - attempt to predict extreme outliers (> upper inner fence, Q3 + 3*IQR - 
"extreme outliers");
  - reporting until Wednesday, 2020/04/09;
  - clustering queries from the most important features in server response time 
optimization (if necessary - to discuss with the team).

TASK DETAIL
  https://phabricator.wikimedia.org/T248308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, Jakob_WMDE, 
Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, darthmon_wmde, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to