Re: Long-running job OOMs driver process

2016-11-18 Thread Keith Bourgoin
eu...@ipponusa.com> > *Sent:* Friday, November 18, 2016 10:17 AM > *To:* Nathan Lande > *Cc:* Keith Bourgoin; Irina Truong; u...@spark.incubator.apache.org > *Subject:* Re: Long-running job OOMs driver process > > +1 for using S3A. > > It would also depend on what format you're us

Re: Long-running job OOMs driver process

2016-11-18 Thread Yong Zhang
ong; u...@spark.incubator.apache.org Subject: Re: Long-running job OOMs driver process +1 for using S3A. It would also depend on what format you're using. I agree with Steve that Parquet, for instance, is a good option. If you're using plain text files, some people use GZ files but they cannot be partitioned

Re: Long-running job OOMs driver process

2016-11-18 Thread Alexis Seigneurin
+1 for using S3A. It would also depend on what format you're using. I agree with Steve that Parquet, for instance, is a good option. If you're using plain text files, some people use GZ files but they cannot be partitioned, thus putting a lot of pressure on the driver. It doesn't look like this

Re: Long-running job OOMs driver process

2016-11-18 Thread Nathan Lande
+1 to not threading. What does your load look like? If you are loading many files and cacheing them in N rdds rather than 1 rdd this could be an issue. If the above two things don't fix your oom issue, without knowing anything else about your job, I would focus on your cacheing strategy as a

Re: Long-running job OOMs driver process

2016-11-18 Thread Steve Loughran
On 18 Nov 2016, at 14:31, Keith Bourgoin > wrote: We thread the file processing to amortize the cost of things like getting files from S3. Define cost here: actual $ amount, or merely time to read the data? If it's read times, you should really be

Re: Long-running job OOMs driver process

2016-11-18 Thread Keith Bourgoin
Hi Alexis, Thanks for the response. I've been working with Irina on trying to sort this issue out. We thread the file processing to amortize the cost of things like getting files from S3. It's a pattern we've seen recommended in many places, but I don't have any of those links handy. The

Re: Long-running job OOMs driver process

2016-11-17 Thread Alexis Seigneurin
Hi Irina, I would question the use of multiple threads in your application. Since Spark is going to run the processing of each DataFrame on all the cores of your cluster, the processes will be competing for resources. In fact, they would not only compete for CPU cores but also for memory. Spark

Long-running job OOMs driver process

2016-11-17 Thread Irina Truong
We have an application that reads text files, converts them to dataframes, and saves them in Parquet format. The application runs fine when processing a few files, but we have several thousand produced every day. When running the job for all files, we have spark-submit killed on OOM: # #