Ladsgroup moved this task from Doing to Peer Review on the Item Quality Scoring Improvement (Item Quality Scoring Improvement - Sprint 3) board. Ladsgroup added a comment.
(All are micro average <https://datascience.stackexchange.com/questions/15989/micro-average-vs-macro-average-performance-in-a-multiclass-classification-settin> not macro) Old-data, new features | Metric | Without Property suggester | With Property suggester | Difference | | False positive rate | 0.057 | 0.058 | 1.7% decrease | | Accuracy | 0.929 | 0.929 | No difference | | roc_auc | 0.972 | 0.973 | 0.1% increase | | New data, new features | Metric | Without Property suggester | With Property suggester | Difference | | False positive rate | 0.115 | 0.116 | 0.87% increase(!) | | Accuracy | 0.818 | 0.817 | 0.1% decrease(!) | | roc_auc | 0.864 | 0.866 | 0.2% increase | | All data combined, new features | Metric | Without Property suggester | With Property suggester | Difference | | False positive rate | 0.052 | 0.052 | No difference | | Accuracy | 0.93 | 0.931 | 0.1% increase | | roc_auc | 0.964 | 0.965 | 0.1% increase | | As you can see there's not much that can be gained from item completeness metric, my hypotheses is that it used to be useful when all of our features were broken. Let's try with one thing only: Old-data, old features: | Metric | Without Property suggester | With Property suggester | Difference | | False positive rate | 0.066 | 0.066 | No difference | | Accuracy | 0.922 | 0.923 | 0.1% increase | | roc_auc | 0.964 | 0.965 | 0.1% increase | | Nope, no actual change :/ also in the commit that introduced it I can find any improvement either: https://github.com/wikimedia/articlequality/commit/1d0feffdcecbbee6fa11903531edc5e4e91b41e3#diff-834e2f59d8597053582b57dc05d4c08e While we are here, it's nice to compare old features and new features: old-data, without property suggester | Metric | Old features | new features | Difference | | False positive rate | 0.066 | 0.057 | 14% decrease | | Accuracy | 0.922 | 0.929 | 0.8% increase | | roc_auc | 0.964 | 0.972 | 0.8% increase | | It shows a sharp increase in accuracy, 1% might not be much but keep it in mind we are in the long tail. If I want to explain it better, we should flip the values for accuracy and ROC-AUC. Then you will have 9% decrease in inaccuracy (That value for adding property suggester with old data and old features is 1.3% decrease inaccuracy which is still negligible IMO) TASK DETAIL https://phabricator.wikimedia.org/T261850 WORKBOARD https://phabricator.wikimedia.org/project/board/4952/ EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Ladsgroup Cc: Aklapper, GoranSMilovanovic, Lydia_Pintscher, guergana.tzatchkova, Hazizibinmahdi, Akuckartz, darthmon_wmde, Michael, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Ladsgroup, Mbch331
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs