[ https://issues.apache.org/jira/browse/SPARK-30210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Anzel updated SPARK-30210: ------------------------------- Description: Hi all, When I was trying to do some machine learning work with pyspark I ran into a confusing error message: {{# Model and train/test set generated...}} {{ evaluator = BinaryClassificationEvaluator(labelCol=label, metricName='areaUnderROC')}} {{ prediction = model.transform(test_data)}} {{ auc = evaluator.evaluate(prediction)}} {{org.apache.spark.SparkException: Job aborted due to stage failure: Task 37 in stage 21.0 failed 4 times, most recent failure: Lost task 37.3 in stage 21.0 (TID 2811, 10.139.65.48, executor 16): java.lang.ArrayIndexOutOfBoundsException}} After some investigation, I found that the issue was that the data I was trying to predict on only had one label represented, rather than both positive and negative labels. Easy enough to fix, but I would like to ask if we could replace this error with one that explicitly points out the issue. Would it be acceptable to have a check ahead of time on labels that ensures all labels are represented? Alternately, can we change the docs for BinaryClassificationEvaluator to explain what this error means? was: Hi all, When I was trying to do some machine learning work with pyspark I ran into a confusing error message: # Model and train/test set generated evaluator = BinaryClassificationEvaluator(labelCol=label, metricName='areaUnderROC') prediction = model.transform(test_data) auc = evaluator.evaluate(prediction) org.apache.spark.SparkException: Job aborted due to stage failure: Task 37 in stage 21.0 failed 4 times, most recent failure: Lost task 37.3 in stage 21.0 (TID 2811, 10.139.65.48, executor 16): java.lang.ArrayIndexOutOfBoundsException After some investigation, I found that the issue was that the data I was trying to predict on only had one label represented, rather than both positive and negative labels. Easy enough to fix, but I would like to ask if we could replace this error with one that explicitly points out the issue. Would it be acceptable to have a check ahead of time on labels that ensures all labels are represented? Alternately, can we change the docs for BinaryClassificationEvaluator to explain what this error means? > Give more informative error for BinaryClassificationEvaluator when data with > only one label is provided > ------------------------------------------------------------------------------------------------------- > > Key: SPARK-30210 > URL: https://issues.apache.org/jira/browse/SPARK-30210 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.4.5 > Environment: Pyspark on Databricks > Reporter: Paul Anzel > Priority: Minor > > Hi all, > When I was trying to do some machine learning work with pyspark I ran into a > confusing error message: > {{# Model and train/test set generated...}} > {{ evaluator = BinaryClassificationEvaluator(labelCol=label, > metricName='areaUnderROC')}} > {{ prediction = model.transform(test_data)}} > {{ auc = evaluator.evaluate(prediction)}} > {{org.apache.spark.SparkException: Job aborted due to stage failure: Task 37 > in stage 21.0 failed 4 times, most recent failure: Lost task 37.3 in stage > 21.0 (TID 2811, 10.139.65.48, executor 16): > java.lang.ArrayIndexOutOfBoundsException}} > After some investigation, I found that the issue was that the data I was > trying to predict on only had one label represented, rather than both > positive and negative labels. Easy enough to fix, but I would like to ask if > we could replace this error with one that explicitly points out the issue. > Would it be acceptable to have a check ahead of time on labels that ensures > all labels are represented? Alternately, can we change the docs for > BinaryClassificationEvaluator to explain what this error means? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org