Unfortunately I'm getting the same error: The other interesting things are that: - the parquet files got actually written to HDFS (also with .write.parquet() ) - the application gets stuck in the RUNNING state for good even after the error is thrown
15/09/07 10:01:10 INFO spark.ContextCleaner: Cleaned accumulator 19 15/09/07 10:01:10 INFO spark.ContextCleaner: Cleaned accumulator 5 15/09/07 10:01:12 INFO spark.ContextCleaner: Cleaned accumulator 20 Exception in thread "Thread-7" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Thread-7" Exception in thread "org.apache.hadoop.hdfs.PeerCache@4070d501" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "org.apache.hadoop.hdfs.PeerCache@4070d501" Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "LeaseRenewer:r...@docker.rapidminer.com:8020" Exception in thread "Reporter" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Reporter" Exception in thread "qtp2134582502-46" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp2134582502-46" On Mon, Sep 7, 2015 at 3:48 PM, boci <boci.b...@gmail.com> wrote: > Hi, > > Can you try to using save method instead of write? > > ex: out_df.save("path","parquet") > > b0c1 > > > ---------------------------------------------------------------------------------------------------------------------------------- > Skype: boci13, Hangout: boci.b...@gmail.com > > On Mon, Sep 7, 2015 at 3:35 PM, Zoltán Tóth <zoltanct...@gmail.com> wrote: > >> Aaand, the error! :) >> >> Exception in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf" >> Exception: java.lang.OutOfMemoryError thrown from the >> UncaughtExceptionHandler in thread >> "org.apache.hadoop.hdfs.PeerCache@4e000abf" >> Exception in thread "Thread-7" >> Exception: java.lang.OutOfMemoryError thrown from the >> UncaughtExceptionHandler in thread "Thread-7" >> Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020" >> Exception: java.lang.OutOfMemoryError thrown from the >> UncaughtExceptionHandler in thread >> "LeaseRenewer:r...@docker.rapidminer.com:8020" >> Exception in thread "Reporter" >> Exception: java.lang.OutOfMemoryError thrown from the >> UncaughtExceptionHandler in thread "Reporter" >> Exception in thread "qtp2115718813-47" >> Exception: java.lang.OutOfMemoryError thrown from the >> UncaughtExceptionHandler in thread "qtp2115718813-47" >> >> Exception: java.lang.OutOfMemoryError thrown from the >> UncaughtExceptionHandler in thread "sparkDriver-scheduler-1" >> >> Log Type: stdout >> >> Log Upload Time: Mon Sep 07 09:03:01 -0400 2015 >> >> Log Length: 986 >> >> Traceback (most recent call last): >> File "spark-ml.py", line 33, in <module> >> out_df.write.parquet("/tmp/logparquet") >> File >> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_000001/pyspark.zip/pyspark/sql/readwriter.py", >> line 422, in parquet >> File >> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_000001/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", >> line 538, in __call__ >> File >> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_000001/pyspark.zip/pyspark/sql/utils.py", >> line 36, in deco >> File >> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_000001/py4j-0.8.2.1-src.zip/py4j/protocol.py", >> line 300, in get_return_value >> py4j.protocol.Py4JJavaError >> >> >> >> On Mon, Sep 7, 2015 at 3:27 PM, Zoltán Tóth <zoltanct...@gmail.com> >> wrote: >> >>> Hi, >>> >>> When I execute the Spark ML Logisitc Regression example in pyspark I run >>> into an OutOfMemory exception. I'm wondering if any of you experienced the >>> same or has a hint about how to fix this. >>> >>> The interesting bit is that I only get the exception when I try to write >>> the result DataFrame into a file. If I only "print" any of the results, it >>> all works fine. >>> >>> My Setup: >>> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest >>> nightly build) >>> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn >>> -DzincPort=3034 >>> >>> I'm using the default resource setup >>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor >>> containers, each with 1 cores and 1408 MB memory including 384 MB overhead >>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, >>> capability: <memory:1408, vCores:1>) >>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, >>> capability: <memory:1408, vCores:1>) >>> >>> The script I'm executing: >>> from pyspark import SparkContext, SparkConf >>> from pyspark.sql import SQLContext >>> >>> conf = SparkConf().setAppName("pysparktest") >>> sc = SparkContext(conf=conf) >>> sqlContext = SQLContext(sc) >>> >>> from pyspark.mllib.regression import LabeledPoint >>> from pyspark.mllib.linalg import Vector, Vectors >>> >>> training = sc.parallelize(( >>> LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)), >>> LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)), >>> LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)), >>> LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5)))) >>> >>> training_df = training.toDF() >>> >>> from pyspark.ml.classification import LogisticRegression >>> >>> reg = LogisticRegression() >>> >>> reg.setMaxIter(10).setRegParam(0.01) >>> model = reg.fit(training.toDF()) >>> >>> test = sc.parallelize(( >>> LabeledPoint(1.0, Vectors.dense(-1.0, 1.5, 1.3)), >>> LabeledPoint(0.0, Vectors.dense(3.0, 2.0, -0.1)), >>> LabeledPoint(1.0, Vectors.dense(0.0, 2.2, -1.5)))) >>> >>> out_df = model.transform(test.toDF()) >>> >>> out_df.write.parquet("/tmp/logparquet") >>> >>> And the command: >>> spark-submit --master yarn --deploy-mode cluster spark-ml.py >>> >>> Thanks, >>> z >>> >> >> >