Hi,
I have a data set size of 10GB(example Test.txt).
I wrote my pyspark script like below(Test.py):
*from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
spark = SparkSession.builder.appName("FilterProduct").getOrCreate()
sc = spark.sparkContext
Hi,
> If the data is very large then a collect may result in OOM.
That's a general case even in any part of Spark, incl. Spark Structured
Streaming. Why would you collect in addBatch? It's on the driver side and
as anything on the driver, it's a single JVM (and usually not fault
tolerant)
> Do
The root cause is probably that HDFSMetadataLog ignores exceptions thrown
by "output.close". I think this should be fixed by this line in Spark 2.2.1
and 3.0.0:
https://github.com/apache/spark/commit/6edfff055caea81dc3a98a6b4081313a0c0b0729#diff-aaeb546880508bb771df502318c40a99L126
Could you try
Thanks Tathagata for your answer.
The reason I was asking about controlling data size is that the javadoc
indicate you can use foreach or collect on the dataframe. If the data is very
large then a collect may result in OOM.
>From your answer it appears that the only way to control the size (in
On Wed, Jan 3, 2018 at 8:18 PM, John Zhuge wrote:
> Something like:
>
> Note: When running Spark on YARN, environment variables for the executors
> need to be set using the spark.yarn.executorEnv.[EnvironmentVariableName]
> property in your conf/spark-defaults.conf file or
I am running a Structured Streaming job (Spark 2.2.0) using EMR 5.9. The
job sources data from a Kafka topic, performs a variety of filters and
transformations, and sinks data back into a different Kafka topic.
Once per day, we stop the query in order to merge the namenode edit logs
with the