I'm using pyspark ML's logistic regression implementation to do some classification on an AWS EMR Yarn cluster.
The cluster consists of 10 m3.xlarge nodes and is set up as follows: spark.driver.memory 10g, spark.driver.cores 3 , spark.executor.memory 10g, spark.executor-cores 4. I enabled yarn's dynamic allocation abilities. The problem is that my results are way unstable. Sometimes my application finishes using 13 executors total, sometimes all of them seem to die and the application ends up using anywhere between 100 and 200... Any insight on what could cause this stochastic behaviour would be greatly appreciated. The code used to run the logistic regression: data = spark.read.parquet(storage_path).repartition(80) lr = LogisticRegression() lr.setMaxIter(50) lr.setRegParam(0.063) evaluator = BinaryClassificationEvaluator() lrModel = lr.fit(data.filter(data.test == 0)) predictions = lrModel.transform(data.filter(data.test == 1)) auROC = evaluator.evaluate(predictions) print "auROC on test set: ", auROC Data is a dataframe of roughly 2.8GB -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-s-Logistic-Regression-runs-unstable-on-Yarn-cluster-tp27520.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org