Hi there, We are using mllib 1.1.1, and doing Logistics Regression with a dataset of about 150M rows. The training part usually goes pretty smoothly without any retries. But during the prediction stage and BinaryClassificationMetrics stage, I am seeing retries with error of "fetch failure".
The prediction part is just as follows: val predictionAndLabel = testRDD.map { point => val prediction = model.predict(point.features) (prediction, point.label) } ... val metrics = new BinaryClassificationMetrics(predictionAndLabel) The fetch failure happened with the following stack trace: org.apache.spark.rdd.PairRDDFunctions.combineByKey(PairRDDFunctions.scala:515) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$3$lzycompute(BinaryClassificationMetrics.scala:101) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$3(BinaryClassificationMetrics.scala:96) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions$lzycompute(BinaryClassificationMetrics.scala:98) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions(BinaryClassificationMetrics.scala:98) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.createCurve(BinaryClassificationMetrics.scala:142) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.roc(BinaryClassificationMetrics.scala:50) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.areaUnderROC(BinaryClassificationMetrics.scala:60) com.manage.ml.evaluation.BinaryClassificationMetrics.areaUnderROC(BinaryClassificationMetrics.scala:14) ... We are doing this in the yarn-client mode. 32 executors, 16G executor memory, and 12 cores as the spark-submit settings. I wonder if anyone has suggestion on how to debug this. thanks in advance thomas