[Dev] [ML] Accuracy Measure for Anomaly Detection?

2015-09-16 Thread Ashen Weerathunga
Hi all, I am currently doing the integration of anomaly detection feature for ML. I have a problem of choosing the best accuracy measure for the model. I can get the confusion matrix which consists of true positives, true negatives, false positives and false negatives. There are few different meas

Re: [Dev] [ML] Accuracy Measure for Anomaly Detection?

2015-09-16 Thread Sinnathamby Mahesan
Dear Ashen Sensitivity - in view of reducing the false negative Precision - in view of reducing the false positive F1 score combines both as the harmonic mean of precision and sensitivity That's why F1 is chosen normally and is simple (2TP / (2TP + FN + FP)) By the way, which you consider i

Re: [Dev] [ML] Accuracy Measure for Anomaly Detection?

2015-09-16 Thread CD Athuraliya
Hi Ashen, Please note the class imbalance which can typically occur in anomaly data when selecting evaluation measures (anomalous data can be very infrequent compared to normal data in a real-world dataset). Please check how this imbalance affects evaluation measures. I found this paper [1] on thi

Re: [Dev] [ML] Accuracy Measure for Anomaly Detection?

2015-09-16 Thread Srinath Perera
Seshika and myself were talking to forester analyst and he mentioned "Lorenz curve" is used in fraud cases. Please read and find out what it is and how it compare to RoC etc. see https://www.quora.com/What-is-the-difference-between-a-ROC-curve-and-a-precision-recall-curve-When-should-I-use-each

Re: [Dev] [ML] Accuracy Measure for Anomaly Detection?

2015-09-16 Thread Srinath Perera
Ashen, when you conclude this, can you write a blog/ article on comparing different methods and why given thing is better. --Srinath On Thu, Sep 17, 2015 at 9:59 AM, Srinath Perera wrote: > Seshika and myself were talking to forester analyst and he mentioned "Lorenz > curve" is used in fraud ca

Re: [Dev] [ML] Accuracy Measure for Anomaly Detection?

2015-09-16 Thread madhuka udantha
Hi, This is good survey paper that can be found regard to Anomaly detection [1], According to your need; it seems you will no need to go through whole the survey papers. But few sub topics will be very useful for you. This paper will be useful for your work. [1] Varun Chandola, Arindam Banerjee,

Re: [Dev] [ML] Accuracy Measure for Anomaly Detection?

2015-09-18 Thread Ashen Weerathunga
Hi all. Since we are considering the anomaly detection true positive would be a case where a true anomaly detected as a anomaly by the model. Since in the real world scenario of anomaly detection as you said the positive(anomaly) instances are vary rare we can't go for more general measure. So I c

Re: [Dev] [ML] Accuracy Measure for Anomaly Detection?

2015-09-23 Thread Ashen Weerathunga
Hi all, Thanks Mahesan for the suggestion. yes we can give all the measure if It is better. But there is some problem of drawing PR curve or ROC curve. Since we can get only one point using the confusion matrix we cant give PR curve or ROC curve in the summary of the model. Currently ROC curve pr

Re: [Dev] [ML] Accuracy Measure for Anomaly Detection?

2015-09-24 Thread Supun Sethunga
Hi Ashen, In probabilistic models, what we do is, compare the predicted output of a new data-point, against a cutoff probability, to decide which class it belongs to. And this cutoff probability is decided by the user, hence has the freedom to change from 0 to 1. So for a set of newly-arrived data

Re: [Dev] [ML] Accuracy Measure for Anomaly Detection?

2015-09-24 Thread Supun Sethunga
> > ...test data according to the percentile value that user provided. Sorry I missed this part. If so, can't we not ask the user the percentile, but instead create the ROC and let him decide the best percentile looking at the ROC? On Thu, Sep 24, 2015 at 10:26 AM, Supun Sethunga wrote: > Hi A

Re: [Dev] [ML] Accuracy Measure for Anomaly Detection?

2015-09-24 Thread Ashen Weerathunga
Thanks Dr. Ruvan and Supun for the suggestions! Yes Supun, In this scenario we consider a percentile value of all distances to identify the cluster boundary rather than just considering the max distance. Right now we are getting that percentile value from the user. yes, If we do calculate set of c