I'm using pyspark ML's logistic regression implementation to do some
classification on an AWS EMR Yarn cluster.

The cluster consists of 10 m3.xlarge nodes and is set up as follows:
spark.driver.memory 10g, spark.driver.cores  3 , spark.executor.memory 10g,
spark.executor-cores 4.

I enabled yarn's dynamic allocation abilities.

The problem is that my results are way unstable. Sometimes my application
finishes using 13 executors total, sometimes all of them seem to die and the
application ends up using anywhere between 100 and 200...

Any insight on what could cause this stochastic behaviour would be greatly
appreciated.

The code used to run the logistic regression:

data = spark.read.parquet(storage_path).repartition(80)
lr = LogisticRegression()
lr.setMaxIter(50)
lr.setRegParam(0.063)
evaluator = BinaryClassificationEvaluator()
lrModel = lr.fit(data.filter(data.test == 0))
predictions = lrModel.transform(data.filter(data.test == 1))
auROC = evaluator.evaluate(predictions)
print "auROC on test set: ", auROC
Data is a dataframe of roughly 2.8GB



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-s-Logistic-Regression-runs-unstable-on-Yarn-cluster-tp27520.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to