retry in combineByKey at BinaryClassificationMetrics.scala
Hi there, We are using mllib 1.1.1, and doing Logistics Regression with a dataset of about 150M rows. The training part usually goes pretty smoothly without any retries. But during the prediction stage and BinaryClassificationMetrics stage, I am seeing retries with error of "fetch failure". The prediction part is just as follows: val predictionAndLabel = testRDD.map { point => val prediction = model.predict(point.features) (prediction, point.label) } ... val metrics = new BinaryClassificationMetrics(predictionAndLabel) The fetch failure happened with the following stack trace: org.apache.spark.rdd.PairRDDFunctions.combineByKey(PairRDDFunctions.scala:515) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$3$lzycompute(BinaryClassificationMetrics.scala:101) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$3(BinaryClassificationMetrics.scala:96) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions$lzycompute(BinaryClassificationMetrics.scala:98) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions(BinaryClassificationMetrics.scala:98) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.createCurve(BinaryClassificationMetrics.scala:142) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.roc(BinaryClassificationMetrics.scala:50) org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.areaUnderROC(BinaryClassificationMetrics.scala:60) com.manage.ml.evaluation.BinaryClassificationMetrics.areaUnderROC(BinaryClassificationMetrics.scala:14) ... We are doing this in the yarn-client mode. 32 executors, 16G executor memory, and 12 cores as the spark-submit settings. I wonder if anyone has suggestion on how to debug this. thanks in advance thomas
weights not changed with different reg param
Hi there We are on mllib 1.1.1, and trying different regularization parameters. We noticed that the regParam dont affect the weights at all. Is setting the reg param via the optimizer the right thing to do? Do we need to set our own updater? Anyone else seeing the same behaviour? thanks again thomas
StackOverflowError on RDD.union
I am trying to combine multiple RDDs into 1 RDD, and I am using the union function. I wonder if anyone has seen StackOverflowError as follows: Exception in thread "main" java.lang.StackOverflowError at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120)