Re: partitionBy causing OOM

2017-09-26 Thread Amit Sela
Thanks for all the answers! It looks like increasing the heap a little, and setting spark.sql. shuffle.partitions to a much lower number (I used the recommended input_size_mb/128 formula) did the trick. As for partitionBy, unless I use repartition("dt") before the writer, it actually writes more

Re: partitionBy causing OOM

2017-09-25 Thread ayan guha
Another possible option would be creating partitioned table in hive and use dynamic partitioning while inserting. This will not require spark to do explocit partition by On Tue, 26 Sep 2017 at 12:39 pm, Ankur Srivastava < ankur.srivast...@gmail.com> wrote: > Hi Amit, > > Spark keeps the

Re: partitionBy causing OOM

2017-09-25 Thread Ankur Srivastava
Hi Amit, Spark keeps the partition that it is working on in memory (and does not spill to disk even if it is running OOM). Also since you are getting OOM when using partitionBy (and not when you just use flatMap), there should be one (or few) dates on which your partition size is bigger than the

Re: partitionBy causing OOM

2017-09-25 Thread 孫澤恩
Hi, Amit, Maybe you can change this configuration spark.sql.shuffle.partitions. The default is 200 change this property could change the task number when you are using DataFrame API. > On 26 Sep 2017, at 1:25 AM, Amit Sela wrote: > > I'm trying to run a simple pyspark

partitionBy causing OOM

2017-09-25 Thread Amit Sela
I'm trying to run a simple pyspark application that reads from file (json), flattens it (explode) and writes back to file (json) partitioned by date using DataFrameWriter.partitionBy(*cols). I keep getting OOMEs like: java.lang.OutOfMemoryError: Java heap space at