Xinyong Tian created SPARK-24431:
------------------------------------

             Summary: wrong areaUnderPR calculation in 
BinaryClassificationEvaluator 
                 Key: SPARK-24431
                 URL: https://issues.apache.org/jira/browse/SPARK-24431
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.2.0
            Reporter: Xinyong Tian


My problem, I am using CrossValidator(estimator=LogisticRegression(...), ...,  
evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to select 
best model. when the regParam in logistict regression is very high, no variable 
will be selected (no model), ie every row 's prediction is same ,eg. equal 
event rate (baseline frequency). But at this point,  
BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best 
model seleted is a no model. 

the reason is following.  at time of no model, precision recall curve will be 
only two points: at recall =0, precision should be set to  zero , while the 
software set it to 1. at recall=1, precision is the event rate. As a result, 
the areaUnderPR will be close 0.5 (my even rate is very low), which is maximum .

the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to