GoranSMilovanovic added a comment.
Update `Tue 28 Apr 2020 02:17:33 AM UTC` Here goes the update report on SPARQL feature selection via XGBoost: F31783672: WDQS Endpoint Analytics_20200427_B.nb.html <https://phabricator.wikimedia.org/F31783672> - The model performance was improved mainly by (a) improving upon the feature engineering process (currently: not great, not terrible), and (b) controlling for a highly imbalanced design (i.e. the number of queries with "typical" processing times heavily outnumber the number of queries with "extreme" processing times in the sample) by switching from XGBoost control parameters (like `scale_pos_weight`) to a manually implemented Downsampling strategy; - I have switched from a definition of "extremely long processing time" as an extreme outlier to a definition which takes it to be a *mild outlier*: it poses a more difficult binary classification problem but still we get significant improvements (spot the difference between the Hit and False Alarm rate): - model **accuracy** is around 85%; - **Hit rate** (or True Positive Rate) is around 72%, and - **False alarm rate** (or False Positive Rate) is about 13%. The list of critical SPARQL features (plus what has been extracted as a feature from `event.wdqs_external_sparql_query` is found in *Section 4. Selected features*. Good night. TASK DETAIL https://phabricator.wikimedia.org/T248308 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic Cc: MGerlach, JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, Jakob_WMDE, Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs