Java heap space OutOfMemoryError in pyspark spark-submit (spark version:2.2)

2018-01-04 Thread Anu B Nair
Hi, I have a data set size of 10GB(example Test.txt). I wrote my pyspark script like below(Test.py): *from pyspark import SparkConf from pyspark.sql import SparkSession from pyspark.sql import SQLContext spark = SparkSession.builder.appName("FilterProduct").getOrCreate() sc = spark.sparkContext

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-04 Thread Jacek Laskowski
Hi, > If the data is very large then a collect may result in OOM. That's a general case even in any part of Spark, incl. Spark Structured Streaming. Why would you collect in addBatch? It's on the driver side and as anything on the driver, it's a single JVM (and usually not fault tolerant) > Do

Re: Structured Streaming + Kafka - Corrupted Checkpoint Offsets / Commits

2018-01-04 Thread Shixiong(Ryan) Zhu
The root cause is probably that HDFSMetadataLog ignores exceptions thrown by "output.close". I think this should be fixed by this line in Spark 2.2.1 and 3.0.0: https://github.com/apache/spark/commit/6edfff055caea81dc3a98a6b4081313a0c0b0729#diff-aaeb546880508bb771df502318c40a99L126 Could you try

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-04 Thread M Singh
Thanks Tathagata for your answer. The reason I was asking about controlling data size is that the javadoc indicate you can use foreach or collect on the dataframe.  If the data is very large then a collect may result in OOM. >From your answer it appears that the only way to control the size (in

Re: Is spark-env.sh sourced by Application Master and Executor for Spark on YARN?

2018-01-04 Thread Marcelo Vanzin
On Wed, Jan 3, 2018 at 8:18 PM, John Zhuge wrote: > Something like: > > Note: When running Spark on YARN, environment variables for the executors > need to be set using the spark.yarn.executorEnv.[EnvironmentVariableName] > property in your conf/spark-defaults.conf file or

Structured Streaming + Kafka - Corrupted Checkpoint Offsets / Commits

2018-01-04 Thread William Briggs
I am running a Structured Streaming job (Spark 2.2.0) using EMR 5.9. The job sources data from a Kafka topic, performs a variety of filters and transformations, and sinks data back into a different Kafka topic. Once per day, we stop the query in order to merge the namenode edit logs with the