Aaand, the error! :)

Exception in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread
"org.apache.hadoop.hdfs.PeerCache@4e000abf"
Exception in thread "Thread-7"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "Thread-7"
Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread
"LeaseRenewer:r...@docker.rapidminer.com:8020"
Exception in thread "Reporter"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "Reporter"
Exception in thread "qtp2115718813-47"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "qtp2115718813-47"

Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "sparkDriver-scheduler-1"

Log Type: stdout

Log Upload Time: Mon Sep 07 09:03:01 -0400 2015

Log Length: 986

Traceback (most recent call last):
  File "spark-ml.py", line 33, in <module>
    out_df.write.parquet("/tmp/logparquet")
  File 
"/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_000001/pyspark.zip/pyspark/sql/readwriter.py",
line 422, in parquet
  File 
"/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_000001/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File 
"/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_000001/pyspark.zip/pyspark/sql/utils.py",
line 36, in deco
  File 
"/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_000001/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError



On Mon, Sep 7, 2015 at 3:27 PM, Zoltán Tóth <zoltanct...@gmail.com> wrote:

> Hi,
>
> When I execute the Spark ML Logisitc Regression example in pyspark I run
> into an OutOfMemory exception. I'm wondering if any of you experienced the
> same or has a hint about how to fix this.
>
> The interesting bit is that I only get the exception when I try to write
> the result DataFrame into a file. If I only "print" any of the results, it
> all works fine.
>
> My Setup:
> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest
> nightly build)
> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
> -DzincPort=3034
>
> I'm using the default resource setup
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor
> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
> capability: <memory:1408, vCores:1>)
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
> capability: <memory:1408, vCores:1>)
>
> The script I'm executing:
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext
>
> conf = SparkConf().setAppName("pysparktest")
> sc = SparkContext(conf=conf)
> sqlContext = SQLContext(sc)
>
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.linalg import Vector, Vectors
>
> training = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)),
>   LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5))))
>
> training_df = training.toDF()
>
> from pyspark.ml.classification import LogisticRegression
>
> reg = LogisticRegression()
>
> reg.setMaxIter(10).setRegParam(0.01)
> model = reg.fit(training.toDF())
>
> test = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(-1.0, 1.5, 1.3)),
>   LabeledPoint(0.0, Vectors.dense(3.0, 2.0, -0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 2.2, -1.5))))
>
> out_df = model.transform(test.toDF())
>
> out_df.write.parquet("/tmp/logparquet")
>
> And the command:
> spark-submit --master yarn --deploy-mode cluster spark-ml.py
>
> Thanks,
> z
>

Reply via email to