Re: Trouble with large Yarn job

2015-01-14 Thread Anders Arpteg
Interesting, sounds plausible. Another way to avoid the problem has been to cache intermediate output for large jobs (i.e. split large jobs into smaller and then union together) Unfortunately that this type of tweaking should be necessary though, hopefully better in 1.2.1. On Tue, Jan 13, 2015 at

Re: Trouble with large Yarn job

2015-01-12 Thread Sven Krasser
Anders, This could be related to this open ticket: https://issues.apache.org/jira/browse/SPARK-5077. A call to coalesce() also fixed that for us as a stopgap. Best, -Sven On Mon, Jan 12, 2015 at 10:18 AM, Anders Arpteg arp...@spotify.com wrote: Yes sure Sandy, I've checked the logs and it's

Re: Trouble with large Yarn job

2015-01-12 Thread Anders Arpteg
Yes sure Sandy, I've checked the logs and it's not a OOM issue. I've actually been able to solve the problem finally, and it seems to be an issue with too many partitions. The repartitioning I tried initially did so after the union, and then it's too late. By repartitioning as early as possible,

Re: Trouble with large Yarn job

2015-01-11 Thread Sandy Ryza
Hi Anders, Have you checked your NodeManager logs to make sure YARN isn't killing executors for exceeding memory limits? -Sandy On Tue, Jan 6, 2015 at 8:20 AM, Anders Arpteg arp...@spotify.com wrote: Hey, I have a job that keeps failing if too much data is processed, and I can't see how to

Trouble with large Yarn job

2015-01-07 Thread Anders Arpteg
Hey, I have a job that keeps failing if too much data is processed, and I can't see how to get it working. I've tried repartitioning with more partitions and increasing amount of memory for the executors (now about 12G and 400 executors. Here is a snippets of the first part of the code, which

Trouble with large Yarn job

2015-01-06 Thread Anders Arpteg
Hey, I have a job that keeps failing if too much data is processed, and I can't see how to get it working. I've tried repartitioning with more partitions and increasing amount of memory for the executors (now about 12G and 400 executors. Here is a snippets of the first part of the code, which