Hi, Can you try to using save method instead of write?
ex: out_df.save("path","parquet") b0c1 ---------------------------------------------------------------------------------------------------------------------------------- Skype: boci13, Hangout: boci.b...@gmail.com On Mon, Sep 7, 2015 at 3:35 PM, Zoltán Tóth <zoltanct...@gmail.com> wrote: > Aaand, the error! :) > > Exception in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf" > Exception: java.lang.OutOfMemoryError thrown from the > UncaughtExceptionHandler in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf" > Exception in thread "Thread-7" > Exception: java.lang.OutOfMemoryError thrown from the > UncaughtExceptionHandler in thread "Thread-7" > Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020" > Exception: java.lang.OutOfMemoryError thrown from the > UncaughtExceptionHandler in thread > "LeaseRenewer:r...@docker.rapidminer.com:8020" > Exception in thread "Reporter" > Exception: java.lang.OutOfMemoryError thrown from the > UncaughtExceptionHandler in thread "Reporter" > Exception in thread "qtp2115718813-47" > Exception: java.lang.OutOfMemoryError thrown from the > UncaughtExceptionHandler in thread "qtp2115718813-47" > > Exception: java.lang.OutOfMemoryError thrown from the > UncaughtExceptionHandler in thread "sparkDriver-scheduler-1" > > Log Type: stdout > > Log Upload Time: Mon Sep 07 09:03:01 -0400 2015 > > Log Length: 986 > > Traceback (most recent call last): > File "spark-ml.py", line 33, in <module> > out_df.write.parquet("/tmp/logparquet") > File > "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_000001/pyspark.zip/pyspark/sql/readwriter.py", > line 422, in parquet > File > "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_000001/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_000001/pyspark.zip/pyspark/sql/utils.py", > line 36, in deco > File > "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_000001/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError > > > > On Mon, Sep 7, 2015 at 3:27 PM, Zoltán Tóth <zoltanct...@gmail.com> wrote: > >> Hi, >> >> When I execute the Spark ML Logisitc Regression example in pyspark I run >> into an OutOfMemory exception. I'm wondering if any of you experienced the >> same or has a hint about how to fix this. >> >> The interesting bit is that I only get the exception when I try to write >> the result DataFrame into a file. If I only "print" any of the results, it >> all works fine. >> >> My Setup: >> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest >> nightly build) >> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn >> -DzincPort=3034 >> >> I'm using the default resource setup >> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor >> containers, each with 1 cores and 1408 MB memory including 384 MB overhead >> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, >> capability: <memory:1408, vCores:1>) >> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, >> capability: <memory:1408, vCores:1>) >> >> The script I'm executing: >> from pyspark import SparkContext, SparkConf >> from pyspark.sql import SQLContext >> >> conf = SparkConf().setAppName("pysparktest") >> sc = SparkContext(conf=conf) >> sqlContext = SQLContext(sc) >> >> from pyspark.mllib.regression import LabeledPoint >> from pyspark.mllib.linalg import Vector, Vectors >> >> training = sc.parallelize(( >> LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)), >> LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)), >> LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)), >> LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5)))) >> >> training_df = training.toDF() >> >> from pyspark.ml.classification import LogisticRegression >> >> reg = LogisticRegression() >> >> reg.setMaxIter(10).setRegParam(0.01) >> model = reg.fit(training.toDF()) >> >> test = sc.parallelize(( >> LabeledPoint(1.0, Vectors.dense(-1.0, 1.5, 1.3)), >> LabeledPoint(0.0, Vectors.dense(3.0, 2.0, -0.1)), >> LabeledPoint(1.0, Vectors.dense(0.0, 2.2, -1.5)))) >> >> out_df = model.transform(test.toDF()) >> >> out_df.write.parquet("/tmp/logparquet") >> >> And the command: >> spark-submit --master yarn --deploy-mode cluster spark-ml.py >> >> Thanks, >> z >> > >