eu...@ipponusa.com>
> *Sent:* Friday, November 18, 2016 10:17 AM
> *To:* Nathan Lande
> *Cc:* Keith Bourgoin; Irina Truong; u...@spark.incubator.apache.org
> *Subject:* Re: Long-running job OOMs driver process
>
> +1 for using S3A.
>
> It would also depend on what format you're us
ong; u...@spark.incubator.apache.org
Subject: Re: Long-running job OOMs driver process
+1 for using S3A.
It would also depend on what format you're using. I agree with Steve that
Parquet, for instance, is a good option. If you're using plain text files, some
people use GZ files but they cannot be partitioned
+1 for using S3A.
It would also depend on what format you're using. I agree with Steve that
Parquet, for instance, is a good option. If you're using plain text files,
some people use GZ files but they cannot be partitioned, thus putting a lot
of pressure on the driver. It doesn't look like this
+1 to not threading.
What does your load look like? If you are loading many files and cacheing
them in N rdds rather than 1 rdd this could be an issue.
If the above two things don't fix your oom issue, without knowing anything
else about your job, I would focus on your cacheing strategy as a
On 18 Nov 2016, at 14:31, Keith Bourgoin
> wrote:
We thread the file processing to amortize the cost of things like getting files
from S3.
Define cost here: actual $ amount, or merely time to read the data?
If it's read times, you should really be
Hi Alexis,
Thanks for the response. I've been working with Irina on trying to sort
this issue out.
We thread the file processing to amortize the cost of things like getting
files from S3. It's a pattern we've seen recommended in many places, but I
don't have any of those links handy. The
Hi Irina,
I would question the use of multiple threads in your application. Since
Spark is going to run the processing of each DataFrame on all the cores of
your cluster, the processes will be competing for resources. In fact, they
would not only compete for CPU cores but also for memory.
Spark
We have an application that reads text files, converts them to dataframes,
and saves them in Parquet format. The application runs fine when processing
a few files, but we have several thousand produced every day. When running
the job for all files, we have spark-submit killed on OOM:
#
#