Ladsgroup moved this task from Doing to Peer Review on the Item Quality Scoring 
Improvement (Item Quality Scoring Improvement - Sprint 3) board.
Ladsgroup added a comment.


  (All are micro average 
<https://datascience.stackexchange.com/questions/15989/micro-average-vs-macro-average-performance-in-a-multiclass-classification-settin>
 not macro)
  
  Old-data, new features
  
  | Metric              | Without Property suggester | With Property suggester 
| Difference    |
  | False positive rate | 0.057                      | 0.058                   
| 1.7% decrease |
  | Accuracy            | 0.929                      | 0.929                   
| No difference |
  | roc_auc             | 0.972                      | 0.973                   
| 0.1% increase |
  |
  
  New data, new features
  
  | Metric              | Without Property suggester | With Property suggester 
| Difference        |
  | False positive rate | 0.115                      | 0.116                   
| 0.87% increase(!) |
  | Accuracy            | 0.818                      | 0.817                   
| 0.1% decrease(!)  |
  | roc_auc             | 0.864                      | 0.866                   
| 0.2% increase     |
  |
  
  All data combined, new features
  
  | Metric              | Without Property suggester | With Property suggester 
| Difference    |
  | False positive rate | 0.052                      | 0.052                   
| No difference |
  | Accuracy            | 0.93                       | 0.931                   
| 0.1% increase |
  | roc_auc             | 0.964                      | 0.965                   
| 0.1% increase |
  |
  
  As you can see there's not much that can be gained from item completeness 
metric, my hypotheses is that it used to be useful when all of our features 
were broken. Let's try with one thing only:
  
  Old-data, old features:
  
  | Metric              | Without Property suggester | With Property suggester 
| Difference    |
  | False positive rate | 0.066                      | 0.066                   
| No difference |
  | Accuracy            | 0.922                      | 0.923                   
| 0.1% increase |
  | roc_auc             | 0.964                      | 0.965                   
| 0.1% increase |
  |
  
  Nope, no actual change :/ also in the commit that introduced it I can find 
any improvement either: 
https://github.com/wikimedia/articlequality/commit/1d0feffdcecbbee6fa11903531edc5e4e91b41e3#diff-834e2f59d8597053582b57dc05d4c08e
  
  While we are here, it's nice to compare old features and new features:
  old-data, without property suggester
  
  | Metric              | Old features | new features | Difference    |
  | False positive rate | 0.066        | 0.057        | 14% decrease  |
  | Accuracy            | 0.922        | 0.929        | 0.8% increase |
  | roc_auc             | 0.964        | 0.972        | 0.8% increase |
  |
  
  It shows a sharp increase in accuracy, 1% might not be much but keep it in 
mind we are in the long tail. If I want to explain it better, we should flip 
the values for accuracy and ROC-AUC. Then you will have 9% decrease in 
inaccuracy (That value for adding property suggester with old data and old 
features is 1.3% decrease inaccuracy which is still negligible IMO)

TASK DETAIL
  https://phabricator.wikimedia.org/T261850

WORKBOARD
  https://phabricator.wikimedia.org/project/board/4952/

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Ladsgroup
Cc: Aklapper, GoranSMilovanovic, Lydia_Pintscher, guergana.tzatchkova, 
Hazizibinmahdi, Akuckartz, darthmon_wmde, Michael, Nandana, Lahi, Gq86, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Ladsgroup, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to