In general, Java processes fail with an OutOfMemoryError when your code and
data does not fit into the memory allocated to the runtime. In Spark, that
memory is controlled through the --executor-memory flag.
If you are running Spark on YARN, then YARN configuration will dictate the
maximum memory
This article recommends setting spark.locality.wait to 10 (milliseconds) in
the case of using Spark Streaming and gives an explanation of why they
chose that value. If using batch Spark, that value should still be a good
starting place
https://www.inovex.de/blog/247-spark-streaming-on-yarn-in-prod
I just stumbled upon this issue as well in Spark 1.6.2 when trying to write
my own custom Sink. For anyone else who runs into this issue, there are
two relevant JIRAs that I found, but no solution as of yet:
- https://issues.apache.org/jira/browse/SPARK-14151 - Propose to refactor
and expose Metri
look at this gist as a
simplified example with a thread-unsafe operation being passed to map():
https://gist.github.com/matthew-dailey/4e1ab0aac580151dcfd7fbe6beab84dc
This is for Spark Streaming, but I suspect the answer is the same between
batch and streaming.
Thanks for any help,
Matt