Thanks for all the answers!
It looks like increasing the heap a little, and setting spark.sql.
shuffle.partitions to a much lower number (I used the recommended
input_size_mb/128 formula) did the trick.
As for partitionBy, unless I use repartition("dt") before the writer, it
actually writes more th
Another possible option would be creating partitioned table in hive and use
dynamic partitioning while inserting. This will not require spark to do
explocit partition by
On Tue, 26 Sep 2017 at 12:39 pm, Ankur Srivastava <
ankur.srivast...@gmail.com> wrote:
> Hi Amit,
>
> Spark keeps the partition
Hi Amit,
Spark keeps the partition that it is working on in memory (and does not
spill to disk even if it is running OOM). Also since you are getting OOM
when using partitionBy (and not when you just use flatMap), there should be
one (or few) dates on which your partition size is bigger than the h
Hi, Amit,
Maybe you can change this configuration spark.sql.shuffle.partitions.
The default is 200 change this property could change the task number when you
are using DataFrame API.
> On 26 Sep 2017, at 1:25 AM, Amit Sela wrote:
>
> I'm trying to run a simple pyspark application that reads fr
I'm trying to run a simple pyspark application that reads from file (json),
flattens it (explode) and writes back to file (json) partitioned by date
using DataFrameWriter.partitionBy(*cols).
I keep getting OOMEs like:
java.lang.OutOfMemoryError: Java heap space
at
org.apache.spark.util.collection.