Hi,

When I execute the Spark ML Logisitc Regression example in pyspark I run
into an OutOfMemory exception. I'm wondering if any of you experienced the
same or has a hint about how to fix this.

The interesting bit is that I only get the exception when I try to write
the result DataFrame into a file. If I only "print" any of the results, it
all works fine.

My Setup:
Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest
nightly build)
Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
-DzincPort=3034

I'm using the default resource setup
15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor
containers, each with 1 cores and 1408 MB memory including 384 MB overhead
15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
capability: <memory:1408, vCores:1>)
15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
capability: <memory:1408, vCores:1>)

The script I'm executing:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = SparkConf().setAppName("pysparktest")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vector, Vectors

training = sc.parallelize((
  LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
  LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)),
  LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)),
  LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5))))

training_df = training.toDF()

from pyspark.ml.classification import LogisticRegression

reg = LogisticRegression()

reg.setMaxIter(10).setRegParam(0.01)
model = reg.fit(training.toDF())

test = sc.parallelize((
  LabeledPoint(1.0, Vectors.dense(-1.0, 1.5, 1.3)),
  LabeledPoint(0.0, Vectors.dense(3.0, 2.0, -0.1)),
  LabeledPoint(1.0, Vectors.dense(0.0, 2.2, -1.5))))

out_df = model.transform(test.toDF())

out_df.write.parquet("/tmp/logparquet")

And the command:
spark-submit --master yarn --deploy-mode cluster spark-ml.py

Thanks,
z

Reply via email to