Increasing memory usage on batch job (pyspark)

Aaron Jackson Tue, 01 Dec 2015 21:47:18 -0800

Greetings,

I am processing a "batch" of files and have structured an iterative process
around them. Each batch is processed by first loading the data with
spark-csv, performing some minor transformations and then writing back out
as parquet.  Absolutely no caching or shuffle should occur with anything in
this process.


I watch memory utilization on each executor and I notice a steady increase
in memory with each iteration that completes.  Eventually, we reach the
memory limit set for the executor and the process begins to slowly degrade
and fail.

I'm really unclear about what I am doing that could possibly be causing the
executors to hold on to state between iterations.  Again, I was careful to
make sure there was no caching that occurred.  I've done most of my testing
to date in python, though I will port it to scala to see if the behavior is
potentially isolated to the runtime.

Spark: 1.5.2

~~ Ajaxx

Increasing memory usage on batch job (pyspark)

Reply via email to