Hey, I'd try to debug, profile ResolvedDataSource. As far as I know, your write will be performed by the JVM.
On Mon, Sep 7, 2015 at 4:11 PM Tóth Zoltán <t...@looper.hu> wrote: > Unfortunately I'm getting the same error: > The other interesting things are that: > - the parquet files got actually written to HDFS (also with > .write.parquet() ) > - the application gets stuck in the RUNNING state for good even after the > error is thrown > > 15/09/07 10:01:10 INFO spark.ContextCleaner: Cleaned accumulator 19 > 15/09/07 10:01:10 INFO spark.ContextCleaner: Cleaned accumulator 5 > 15/09/07 10:01:12 INFO spark.ContextCleaner: Cleaned accumulator 20 > Exception in thread "Thread-7" > Exception: java.lang.OutOfMemoryError thrown from the > UncaughtExceptionHandler in thread "Thread-7" > Exception in thread "org.apache.hadoop.hdfs.PeerCache@4070d501" > Exception: java.lang.OutOfMemoryError thrown from the > UncaughtExceptionHandler in thread "org.apache.hadoop.hdfs.PeerCache@4070d501" > Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020" > Exception: java.lang.OutOfMemoryError thrown from the > UncaughtExceptionHandler in thread > "LeaseRenewer:r...@docker.rapidminer.com:8020" > Exception in thread "Reporter" > Exception: java.lang.OutOfMemoryError thrown from the > UncaughtExceptionHandler in thread "Reporter" > Exception in thread "qtp2134582502-46" > Exception: java.lang.OutOfMemoryError thrown from the > UncaughtExceptionHandler in thread "qtp2134582502-46" > > > > > On Mon, Sep 7, 2015 at 3:48 PM, boci <boci.b...@gmail.com> wrote: > >> Hi, >> >> Can you try to using save method instead of write? >> >> ex: out_df.save("path","parquet") >> >> b0c1 >> >> >> ---------------------------------------------------------------------------------------------------------------------------------- >> Skype: boci13, Hangout: boci.b...@gmail.com >> >> On Mon, Sep 7, 2015 at 3:35 PM, Zoltán Tóth <zoltanct...@gmail.com> >> wrote: >> >>> Aaand, the error! :) >>> >>> Exception in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf" >>> Exception: java.lang.OutOfMemoryError thrown from the >>> UncaughtExceptionHandler in thread >>> "org.apache.hadoop.hdfs.PeerCache@4e000abf" >>> Exception in thread "Thread-7" >>> Exception: java.lang.OutOfMemoryError thrown from the >>> UncaughtExceptionHandler in thread "Thread-7" >>> Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020" >>> Exception: java.lang.OutOfMemoryError thrown from the >>> UncaughtExceptionHandler in thread >>> "LeaseRenewer:r...@docker.rapidminer.com:8020" >>> Exception in thread "Reporter" >>> Exception: java.lang.OutOfMemoryError thrown from the >>> UncaughtExceptionHandler in thread "Reporter" >>> Exception in thread "qtp2115718813-47" >>> Exception: java.lang.OutOfMemoryError thrown from the >>> UncaughtExceptionHandler in thread "qtp2115718813-47" >>> >>> Exception: java.lang.OutOfMemoryError thrown from the >>> UncaughtExceptionHandler in thread "sparkDriver-scheduler-1" >>> >>> Log Type: stdout >>> >>> Log Upload Time: Mon Sep 07 09:03:01 -0400 2015 >>> >>> Log Length: 986 >>> >>> Traceback (most recent call last): >>> File "spark-ml.py", line 33, in <module> >>> out_df.write.parquet("/tmp/logparquet") >>> File >>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_000001/pyspark.zip/pyspark/sql/readwriter.py", >>> line 422, in parquet >>> File >>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_000001/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", >>> line 538, in __call__ >>> File >>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_000001/pyspark.zip/pyspark/sql/utils.py", >>> line 36, in deco >>> File >>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_000001/py4j-0.8.2.1-src.zip/py4j/protocol.py", >>> line 300, in get_return_value >>> py4j.protocol.Py4JJavaError >>> >>> >>> >>> On Mon, Sep 7, 2015 at 3:27 PM, Zoltán Tóth <zoltanct...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> When I execute the Spark ML Logisitc Regression example in pyspark I >>>> run into an OutOfMemory exception. I'm wondering if any of you experienced >>>> the same or has a hint about how to fix this. >>>> >>>> The interesting bit is that I only get the exception when I try to >>>> write the result DataFrame into a file. If I only "print" any of the >>>> results, it all works fine. >>>> >>>> My Setup: >>>> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the >>>> latest nightly build) >>>> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn >>>> -DzincPort=3034 >>>> >>>> I'm using the default resource setup >>>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor >>>> containers, each with 1 cores and 1408 MB memory including 384 MB overhead >>>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: >>>> Any, capability: <memory:1408, vCores:1>) >>>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: >>>> Any, capability: <memory:1408, vCores:1>) >>>> >>>> The script I'm executing: >>>> from pyspark import SparkContext, SparkConf >>>> from pyspark.sql import SQLContext >>>> >>>> conf = SparkConf().setAppName("pysparktest") >>>> sc = SparkContext(conf=conf) >>>> sqlContext = SQLContext(sc) >>>> >>>> from pyspark.mllib.regression import LabeledPoint >>>> from pyspark.mllib.linalg import Vector, Vectors >>>> >>>> training = sc.parallelize(( >>>> LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)), >>>> LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)), >>>> LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)), >>>> LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5)))) >>>> >>>> training_df = training.toDF() >>>> >>>> from pyspark.ml.classification import LogisticRegression >>>> >>>> reg = LogisticRegression() >>>> >>>> reg.setMaxIter(10).setRegParam(0.01) >>>> model = reg.fit(training.toDF()) >>>> >>>> test = sc.parallelize(( >>>> LabeledPoint(1.0, Vectors.dense(-1.0, 1.5, 1.3)), >>>> LabeledPoint(0.0, Vectors.dense(3.0, 2.0, -0.1)), >>>> LabeledPoint(1.0, Vectors.dense(0.0, 2.2, -1.5)))) >>>> >>>> out_df = model.transform(test.toDF()) >>>> >>>> out_df.write.parquet("/tmp/logparquet") >>>> >>>> And the command: >>>> spark-submit --master yarn --deploy-mode cluster spark-ml.py >>>> >>>> Thanks, >>>> z >>>> >>> >>> >> >