Hello,
I've been trying to run an iterative spark job that spills 1+ GB to disk
per iteration on a system with limited disk space. I believe there's
enough space if spark would clean up unused data from previous iterations,
but as it stands the number of iterations I can run is limited by
available disk space.
I found a thread on the usage of spark.cleaner.ttl on the old Spark Users
Google group here:
https://groups.google.com/forum/#!topic/spark-users/9ebKcNCDih4
I think this setting may be what I'm looking for, however the cleaner
seems to delete data that's still in use. The effect is I get bizarre
exceptions from Spark complaining about missing broadcast data or
ArrayIndexOutOfBounds. When is spark.cleaner.ttl safe to use? Is it
supposed to delete in-use data or is this a bug/shortcoming?
Cheers,
Michael