[ https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16502761#comment-16502761 ]
Xinyong Tian commented on SPARK-24431: -------------------------------------- Your understanding of event rate is what I meant. I understand that max areaUnderPR can be 1. What I meant is that 0.5 is the max areaUnderPR for the grid I searched. For example. Let us say there is a dataset with event rate 0.01 and the best model's areaUnderPR is 0.30. But without any model ,we can set predicted probability for each row as 0.01. This is the situation when there is too much regularlzation. The problem is that , at this situation , BinaryClassificationEvaluator will calculate areaUnderPR as 0.50(for reason see original description), which is better than the best model . This is not what we want. > wrong areaUnderPR calculation in BinaryClassificationEvaluator > --------------------------------------------------------------- > > Key: SPARK-24431 > URL: https://issues.apache.org/jira/browse/SPARK-24431 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.2.0 > Reporter: Xinyong Tian > Priority: Major > > My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., > evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR')) to > select best model. when the regParam in logistict regression is very high, no > variable will be selected (no model), ie every row 's prediction is same ,eg. > equal event rate (baseline frequency). But at this point, > BinaryClassificationEvaluator set the areaUnderPR highest. As a result best > model seleted is a no model. > the reason is following. at time of no model, precision recall curve will be > only two points: at recall =0, precision should be set to zero , while the > software set it to 1. at recall=1, precision is the event rate. As a result, > the areaUnderPR will be close 0.5 (my even rate is very low), which is > maximum . > the solution is to set precision =0 when recall =0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org