Dear Apache Spark community,
My spark driver crashes and logs does not gives enough explanation of why it
happens:
INFO [2019-10-21 16:33:37,045] ({pool-6-thread-7}
SchedulerFactory.java[jobStarted]:109) - Job 20190926-163704_913596201 started
by scheduler interpreter_2100843352
DEBUG
Honestly I'd recommend you to spend you time to look into the issue, via
taking memory dump per some interval and compare differences (at least
share these dump files to community with redacting if necessary).
Otherwise someone has to try to reproduce without reproducer and even
couldn't reproduce
Dear List,
I've observed some sort of memory leak when using pyspark to run ~100
jobs in local mode. Each job is essentially a create RDD -> create DF
-> write DF sort of flow. The RDD and DFs go out of scope after each
job completes, hence I call this issue a "memory leak." Here's
pseudocode: