Re: [Dev] [ML] Accuracy Measure for Anomaly Detection?

Ashen Weerathunga Thu, 24 Sep 2015 22:22:09 -0700

Thanks Dr. Ruvan and Supun for the suggestions!

Yes Supun, In this scenario we consider a percentile value of all distances
to identify the cluster boundary rather than just considering the max
distance. Right now we are getting that percentile value from the user.
yes, If we do calculate set of confusion matrices by considering set of
boundary values it will help user to identify the best option. will work on
that. Thanks for the idea!


On Thu, Sep 24, 2015 at 8:05 PM, Supun Sethunga <sup...@wso2.com> wrote:

> ...test data according to the percentile value that user provided.
>
>
> Sorry I missed this part. If so, can't we not ask the user the percentile,
> but instead create the ROC and let him decide the best percentile looking
> at the ROC?
>
> On Thu, Sep 24, 2015 at 10:26 AM, Supun Sethunga <sup...@wso2.com> wrote:
>
>> Hi Ashen,
>>
>> In probabilistic models, what we do is, compare the predicted output of a
>> new data-point, against a cutoff probability, to decide which class it
>> belongs to. And this cutoff probability is decided by the user, hence has
>> the freedom to change from 0 to 1. So for a set of newly-arrived data
>> points, we can change the "cutoff probability" for any number of times
>> (between 0-1) and find a series of confusion matrices.
>>
>> But in this case, (From what understood from the other mail thread, the
>> logic applied here is..) you first cluster the data, then for each incoming
>> data, you find the nearest cluster, then compare the distance between the
>> new point and the cluster's center, with the cluster-boundary. (please
>> correct me if i've mistaken). So we have only one static value as the class
>> boundary, and hence cannot have a series of confusion matrices. (which
>> means no ROC). But again, in the other mail thread you mentioned "*select
>> the* *percentile value from distances of each clusters as their cluster
>> boundaries*", Im not really sure what that "percentile" value is, but if
>> this is a volatile value or a user preferred value, I think we can change
>> that and do a similar thing as in the probabilistic case..  This means we
>> are  changing the cluster boundaries and see how the accuracy (or the
>> measurement statistics) change.
>>
>> Regards,
>> Supun
>>
>>
>> On Wed, Sep 23, 2015 at 9:17 AM, Ashen Weerathunga <as...@wso2.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> Thanks Mahesan for the suggestion. yes we can give all the measure if It
>>> is better.
>>>
>>> But there is some problem of drawing PR curve or ROC curve. Since we can
>>> get only one point using the confusion matrix we cant give PR curve or ROC
>>> curve in the summary of the model. Currently ROC curve provided only in
>>> probabilistic classification methods. It's also calculated using the model
>>> itself. But in this scenario we use K means algorithm. after generating the
>>> clusters we evaluate the model using the test data according to the
>>> percentile value that user provided. So as a result we can get the
>>> confusion matrix which consist of TP,TN,FP,FN. But to draw a PR curve or
>>> ROC curve that is not enough. Does anyone have any suggestions about that?
>>> or should we drop it?
>>>
>>> On Mon, Sep 21, 2015 at 7:05 AM, Sinnathamby Mahesan <
>>> sinnatha...@wso2.com> wrote:
>>>
>>>> Ashen
>>>> Here is a situation:
>>>> Doctors  are testing a person for a disease, say, d.
>>>> Doctor's point of view +ve means  patient has (d)
>>>>
>>>> Which is of the following is worse than the other?
>>>> (1) The person who does NOT  have (d)  is identified as having (d)  -
>>>>  (that is, false  positive )
>>>> (2) The person who does have (d) is identified as NOT having (d)   -
>>>>  (that is, false negative)
>>>>
>>>> Doctors  argument is that  we have to be more concern on reducing case
>>>>  (2)
>>>> That is to say,  the sensitivity needs to be high.
>>>>
>>>> Anyway, I also thought it is better to display all measures :
>>>> sensitivity, specificity, precision and F1-Score
>>>> (suggesting to consider sensitivity for the case of  anomalous being
>>>> positive.
>>>>
>>>> Good Luck
>>>> Mahesan
>>>>
>>>>
>>>> On 18 September 2015 at 15:27, Ashen Weerathunga <as...@wso2.com>
>>>> wrote:
>>>>
>>>>> Hi all.
>>>>>
>>>>> Since we are considering the anomaly detection true positive would be
>>>>> a case where a true anomaly detected as a anomaly by the model. Since in
>>>>> the real world scenario of anomaly detection as you said the
>>>>> positive(anomaly) instances are vary rare we can't go for more general
>>>>> measure. So I can summarized the most applicable measures as below,
>>>>>
>>>>>    - Sensitivity(recall) - gives the True Positive Rate. ( TP/(TP +
>>>>>    FN) )
>>>>>    - Precision - gives the probability of predicting a True Positive
>>>>>    from all positive predictions ( TP/(TP+FP) )
>>>>>    - PR cure - Precision recall(Sensitivity) curve - PR curve plots
>>>>>    Precision Vs. Recall.
>>>>>    - F1 score - gives the harmonic mean of Precision and
>>>>>    Sensitivity(recall) ( 2TP / (2TP + FP + FN) )
>>>>>
>>>>> So Precision and the Sensitivity are the most suitable measures to
>>>>> measure a model where positive instances are very less. And PR curve and 
>>>>> F1
>>>>> score are mixtures of both Sensitivity and Precision. So PR curve and F1
>>>>> score can be used to tell how good is the model IMO. We can give
>>>>> Sensitivity and Precision also separately.
>>>>>
>>>>> Thanks everyone for the support.
>>>>>
>>>>> @Srinath, sure, I will write an article.
>>>>>
>>>>>
>>>>> Thanks and Regards,
>>>>>
>>>>> Ashen
>>>>>
>>>>> On Thu, Sep 17, 2015 at 10:19 AM, madhuka udantha <
>>>>> madhukaudan...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> This is good survey paper that can be found regard to Anomaly
>>>>>> detection [1], According to your need; it seems you will no need to go
>>>>>> through whole the survey papers. But few sub topics will be very useful 
>>>>>> for
>>>>>> you. This paper will be useful for your work.
>>>>>>
>>>>>> [1] Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly
>>>>>> detection: A survey. ACM Comput. Surv. 41, 3, Article 15 (July 2009), 58
>>>>>> pages. DOI=10.1145/1541880.1541882
>>>>>> <http://www.researchgate.net/profile/Vipin_Kumar26/publication/220565847_Anomaly_detection_A_survey/links/0deec5161f0ca7302a000000.pdf>
>>>>>> [Cited by 2458]
>>>>>>
>>>>>> On Wed, Sep 16, 2015 at 3:35 PM, Ashen Weerathunga <as...@wso2.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I am currently doing the integration of anomaly detection feature
>>>>>>> for ML. I have a problem of choosing the best accuracy measure for the
>>>>>>> model. I can get the confusion matrix which consists of true positives,
>>>>>>> true negatives, false positives and false negatives. There are few
>>>>>>> different measures such as sensitivity, accuracy, F1 score, etc. So what
>>>>>>> will be the best measure to give as the model accuracy for anomaly
>>>>>>> detection model.
>>>>>>>
>>>>>>> [1] <https://en.wikipedia.org/wiki/Sensitivity_and_specificity>Some
>>>>>>> details about those measures.
>>>>>>>
>>>>>>> Terminology and derivations
>>>>>>> from a confusion matrix
>>>>>>> <https://en.wikipedia.org/wiki/Confusion_matrix> true positive (TP)eqv.
>>>>>>> with hittrue negative (TN)eqv. with correct rejectionfalse positive
>>>>>>> (FP)eqv. with false alarm
>>>>>>> <https://en.wikipedia.org/wiki/False_alarm>, Type I error
>>>>>>> <https://en.wikipedia.org/wiki/Type_I_error>false negative (FN)eqv.
>>>>>>> with miss, Type II error
>>>>>>> <https://en.wikipedia.org/wiki/Type_II_error>
>>>>>>> ------------------------------
>>>>>>> sensitivity <https://en.wikipedia.org/wiki/Sensitivity_%28test%29>
>>>>>>> or true positive rate (TPR)eqv. with hit rate
>>>>>>> <https://en.wikipedia.org/wiki/Hit_rate>, recall
>>>>>>> <https://en.wikipedia.org/wiki/Information_retrieval#Recall>[image:
>>>>>>> \mathit{TPR} = \mathit{TP} / P = \mathit{TP} / 
>>>>>>> (\mathit{TP}+\mathit{FN})]
>>>>>>> specificity <https://en.wikipedia.org/wiki/Specificity_%28tests%29>
>>>>>>> (SPC) or true negative rate[image: \mathit{SPC} = \mathit{TN} / N =
>>>>>>> \mathit{TN} / (\mathit{TN}+\mathit{FP})]precision
>>>>>>> <https://en.wikipedia.org/wiki/Information_retrieval#Precision> or 
>>>>>>> positive
>>>>>>> predictive value
>>>>>>> <https://en.wikipedia.org/wiki/Positive_predictive_value> (PPV)[image:
>>>>>>> \mathit{PPV} = \mathit{TP} / (\mathit{TP} + \mathit{FP})]negative
>>>>>>> predictive value
>>>>>>> <https://en.wikipedia.org/wiki/Negative_predictive_value> (NPV)[image:
>>>>>>> \mathit{NPV} = \mathit{TN} / (\mathit{TN} + \mathit{FN})]fall-out
>>>>>>> <https://en.wikipedia.org/wiki/Information_retrieval#Fall-out> or false
>>>>>>> positive rate <https://en.wikipedia.org/wiki/False_positive_rate>
>>>>>>> (FPR)[image: \mathit{FPR} = \mathit{FP} / N = \mathit{FP} /
>>>>>>> (\mathit{FP} + \mathit{TN}) = 1-\mathit{SPC}]false negative rate
>>>>>>> <https://en.wikipedia.org/wiki/False_negative_rate> (FNR)[image:
>>>>>>> \mathit{FNR} = \mathit{FN} / (\mathit{TP} + \mathit{FN}) = 
>>>>>>> 1-\mathit{TPR}]false
>>>>>>> discovery rate <https://en.wikipedia.org/wiki/False_discovery_rate>
>>>>>>> (FDR)[image: \mathit{FDR} = \mathit{FP} / (\mathit{TP} +
>>>>>>> \mathit{FP}) = 1 - \mathit{PPV}]
>>>>>>> ------------------------------
>>>>>>> accuracy <https://en.wikipedia.org/wiki/Accuracy> (ACC)[image:
>>>>>>> \mathit{ACC} = (\mathit{TP} + \mathit{TN}) / (\mathit{TP} + \mathit{FP} 
>>>>>>> +
>>>>>>> \mathit{FN} + \mathit{TN})]F1 score
>>>>>>> <https://en.wikipedia.org/wiki/F1_score>is the harmonic mean
>>>>>>> <https://en.wikipedia.org/wiki/Harmonic_mean#Harmonic_mean_of_two_numbers>
>>>>>>> of precision
>>>>>>> <https://en.wikipedia.org/wiki/Information_retrieval#Precision> and
>>>>>>> sensitivity 
>>>>>>> <https://en.wikipedia.org/wiki/Sensitivity_%28test%29>[image:
>>>>>>> \mathit{F1} = 2 \mathit{TP} / (2 \mathit{TP} + \mathit{FP} + 
>>>>>>> \mathit{FN})]Matthews
>>>>>>> correlation coefficient
>>>>>>> <https://en.wikipedia.org/wiki/Matthews_correlation_coefficient>
>>>>>>> (MCC)[image: \frac{ \mathit{TP} \times \mathit{TN} - \mathit{FP}
>>>>>>> \times \mathit{FN} } {\sqrt{ (\mathit{TP}+\mathit{FP}) ( \mathit{TP} +
>>>>>>> \mathit{FN} ) ( \mathit{TN} + \mathit{FP} ) ( \mathit{TN} + \mathit{FN} 
>>>>>>> ) }
>>>>>>> }]Informedness[image: \mathit{TPR} + \mathit{SPC} - 1]Markedness
>>>>>>> <https://en.wikipedia.org/wiki/Markedness>[image: \mathit{PPV} +
>>>>>>> \mathit{NPV} - 1]
>>>>>>>
>>>>>>> *Sources: Fawcett (2006) and Powers (2011).*[1]
>>>>>>> <https://en.wikipedia.org/wiki/Sensitivity_and_specificity#cite_note-Fawcett2006-1>
>>>>>>> [2]
>>>>>>> <https://en.wikipedia.org/wiki/Sensitivity_and_specificity#cite_note-Powers2011-2>
>>>>>>>
>>>>>>> Thanks and Regards,
>>>>>>> Ashen
>>>>>>> --
>>>>>>> *Ashen Weerathunga*
>>>>>>> Software Engineer - Intern
>>>>>>> WSO2 Inc.: http://wso2.com
>>>>>>> lean.enterprise.middleware
>>>>>>>
>>>>>>> Email: as...@wso2.com
>>>>>>> Mobile: +94 716042995 <94716042995>
>>>>>>> LinkedIn:
>>>>>>> *http://lk.linkedin.com/in/ashenweerathunga
>>>>>>> <http://lk.linkedin.com/in/ashenweerathunga>*
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Dev mailing list
>>>>>>> Dev@wso2.org
>>>>>>> http://wso2.org/cgi-bin/mailman/listinfo/dev
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Cheers,
>>>>>> Madhuka Udantha
>>>>>> http://madhukaudantha.blogspot.com
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Ashen Weerathunga*
>>>>> Software Engineer - Intern
>>>>> WSO2 Inc.: http://wso2.com
>>>>> lean.enterprise.middleware
>>>>>
>>>>> Email: as...@wso2.com
>>>>> Mobile: +94 716042995 <94716042995>
>>>>> LinkedIn:
>>>>> *http://lk.linkedin.com/in/ashenweerathunga
>>>>> <http://lk.linkedin.com/in/ashenweerathunga>*
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>> Sinnathamby Mahesan
>>>>
>>>>
>>>>
>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>
>>>
>>>
>>>
>>> --
>>> *Ashen Weerathunga*
>>> Software Engineer - Intern
>>> WSO2 Inc.: http://wso2.com
>>> lean.enterprise.middleware
>>>
>>> Email: as...@wso2.com
>>> Mobile: +94 716042995 <94716042995>
>>> LinkedIn:
>>> *http://lk.linkedin.com/in/ashenweerathunga
>>> <http://lk.linkedin.com/in/ashenweerathunga>*
>>>
>>> _______________________________________________
>>> Dev mailing list
>>> Dev@wso2.org
>>> http://wso2.org/cgi-bin/mailman/listinfo/dev
>>>
>>>
>>
>>
>> --
>> *Supun Sethunga*
>> Software Engineer
>> WSO2, Inc.
>> http://wso2.com/
>> lean | enterprise | middleware
>> Mobile : +94 716546324
>>
>
>
>
> --
> *Supun Sethunga*
> Software Engineer
> WSO2, Inc.
> http://wso2.com/
> lean | enterprise | middleware
> Mobile : +94 716546324
>



-- 
*Ashen Weerathunga*
Software Engineer - Intern
WSO2 Inc.: http://wso2.com
lean.enterprise.middleware

Email: as...@wso2.com
Mobile: +94 716042995 <94716042995>
LinkedIn:
*http://lk.linkedin.com/in/ashenweerathunga
<http://lk.linkedin.com/in/ashenweerathunga>*

_______________________________________________
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev

Re: [Dev] [ML] Accuracy Measure for Anomaly Detection?

Reply via email to