Dear List,

I've observed some sort of memory leak when using pyspark to run ~100
jobs in local mode.  Each job is essentially a create RDD -> create DF
-> write DF sort of flow.  The RDD and DFs go out of scope after each
job completes, hence I call this issue a "memory leak."  Here's
pseudocode:

```
row_rdds = []
for i in range(100):
  row_rdd = spark.sparkContext.parallelize([{'a': i} for i in range(1000)])
  row_rdds.append(row_rdd)

for row_rdd in row_rdds:
  df = spark.createDataFrame(row_rdd)
  df.persist()
  print(df.count())
  df.write.save(...) # Save parquet
  df.unpersist()

  # Does not help:
  # del df
  # del row_rdd
```

In my real application:
 * rows are much larger, perhaps 1MB each
 * row_rdds are sized to fit available RAM

I observe that after 100 or so iterations of the second loop (each of
which creates a "job" in the Spark WebUI), the following happens:
 * pyspark workers have fairly stable resident and virtual RAM usage
 * java process eventually approaches resident RAM cap (8GB standard)
but virtual RAM usage keeps ballooning.

Eventually the machine runs out of RAM and the linux OOM killer kills
the java process, resulting in an "IndexError: pop from an empty
deque" error from py4j/java_gateway.py .


Does anybody have any ideas about what's going on?  Note that this is
local mode.  I have personally run standalone masters and submitted a
ton of jobs and never seen something like this over time.  Those were
very different jobs, but perhaps this issue is bespoke to local mode?

Emphasis: I did try to del the pyspark objects and run python GC.
That didn't help at all.

pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image)

12-core i7 with 16GB of ram and 22GB swap file (swap is *on*).

Cheers,
-Paul

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to