Hi there,

We are using mllib 1.1.1, and doing Logistics Regression with a dataset of
about 150M rows.
The training part usually goes pretty smoothly without any retries. But
during the prediction stage and BinaryClassificationMetrics stage, I am
seeing retries with error of "fetch failure".

The prediction part is just as follows:

        val predictionAndLabel = testRDD.map { point =>
            val prediction = model.predict(point.features)
            (prediction, point.label)
        }
...
        val metrics = new BinaryClassificationMetrics(predictionAndLabel)

The fetch failure happened with the following stack trace:

org.apache.spark.rdd.PairRDDFunctions.combineByKey(PairRDDFunctions.scala:515)

org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$3$lzycompute(BinaryClassificationMetrics.scala:101)

org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$3(BinaryClassificationMetrics.scala:96)

org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions$lzycompute(BinaryClassificationMetrics.scala:98)

org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions(BinaryClassificationMetrics.scala:98)

org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.createCurve(BinaryClassificationMetrics.scala:142)

org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.roc(BinaryClassificationMetrics.scala:50)

org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.areaUnderROC(BinaryClassificationMetrics.scala:60)

com.manage.ml.evaluation.BinaryClassificationMetrics.areaUnderROC(BinaryClassificationMetrics.scala:14)

...

We are doing this in the yarn-client mode. 32 executors, 16G executor
memory, and 12 cores as the spark-submit settings.

I wonder if anyone has suggestion on how to debug this.

thanks in advance
thomas

Reply via email to