[ 
https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499109#comment-16499109
 ] 

Teng Peng edited comment on SPARK-24431 at 6/2/18 6:48 PM:
-----------------------------------------------------------

I am trying to understand this description. What's your definition of event 
rate? Is it defined as (TP+FN)/(TP + FP + TN + FN).

Can you take a look at the test ""binary evaluation metrics for RDD where all 
examples have negative label"? Is this an extreme case that close to what you 
have?

Also, 0.5 is not the maximum of areaUnderPR, which could attain 1.0. 


was (Author: teng peng):
I am trying to understand this description. What's your definition of event 
rate? Is it defined as (TP+FN)/(TP + FP + TN + FN).

Can you take a look at the test ""binary evaluation metrics for RDD where all 
examples have negative label"? Is this an extreme case that close to what you 
have?

Also, 0.5 is not the maximum of AreaunderPR, which could attain 1.0. 

> wrong areaUnderPR calculation in BinaryClassificationEvaluator 
> ---------------------------------------------------------------
>
>                 Key: SPARK-24431
>                 URL: https://issues.apache.org/jira/browse/SPARK-24431
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: Xinyong Tian
>            Priority: Major
>
> My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., 
>  evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to 
> select best model. when the regParam in logistict regression is very high, no 
> variable will be selected (no model), ie every row 's prediction is same ,eg. 
> equal event rate (baseline frequency). But at this point,  
> BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best 
> model seleted is a no model. 
> the reason is following.  at time of no model, precision recall curve will be 
> only two points: at recall =0, precision should be set to  zero , while the 
> software set it to 1. at recall=1, precision is the event rate. As a result, 
> the areaUnderPR will be close 0.5 (my even rate is very low), which is 
> maximum .
> the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to